ARRL 10m Contest 1998

Discussion on Score Normalisation

No, not this Its the scores I want.
Or maybe back out altogether to GØAEV's Contest Page.

Normalisation (normalization for Americans) attempts to remove score differentials due to different point scoring regimes (CW v phone), transmitter power level (high, low and QRP), and the availability of stations and multipliers due to geography (a function of propagation and proximity to population highs). Normalised scores should reflect operator ability as much as possible. This page describes the method employed in this attempt and discusses some of the limitations.

An underlying critical assumption in the score normalisation is that there is an equal proportion of serious, out-to-win, contesters in each entry-class and in each geographical area. Although I am sure that there are contesters of this description in every category, some categories will attract more of these individuals than others (e.g. the multi-operator category will include a higher than average proportion of serious entries). I have therefore only used the top 10% of entries in each class to calculate the normalising factors. The appropriate assumption is that the top 10% in each group comprise serious, out-to-win contesters.

This is the methodology adopted in normalising the 1998 ARRL 10m contest scores: -

Results posted in pdf format on the members-only area of the ARRL Web Site were downloaded and copied into MS Excel where sorting and recalculations were carried out. Incidently, I corrected the class of the several multi-op entries that got mixed up with non multi-op scores.
All the results (US/VE and DX) were sorted into entry-class categories (e.g. "mixed-mode low power") in order to normalise for power and mode. There was no sub-division by geography at this stage as this would make the sample size in each group too small.
It was assumed that each entry-class category includes both high and low scores from both advantaged and disadvantaged geographical locations in proportion. If this assumption is not true then normalising factors for some classes may be biased. This may have happened to the multi-operator class where contest dxpeditions increase the proportion of such stations in "exotic" dx locations (like the Caribbean).

Normalising factors were calculated for both total numbers of QSOs and for total multipliers independently. The QSO factor for a category is the mean of all top 10% QSO totals divided by the mean of the top 10% of QSO totals for that category. Likewise for Multiplier totals. The list of factors derived in this way is believable in that the factor differences can be explained in an expected way. A simplified summary re-calculated for mode and power separately is shown below.

Mode QSO Factor Mult Factor Power QSO Factor Mult Factor

A - Mixed 1.2 0.8 A - QRP 2.4 1.4

B - Phone 1.4 1.3 B - Low 1.4 1.2

C - CW 1.9 1.4 C - High 0.7 0.9

D - Multi 0.5 0.6

Factors less than 1 mean QSOs or mults are reduced during normalisation and vice versa. These factors confirm that QSO totals are lowest for CW (that's why CW QSOs count double points!), higher for phone, highest for mixed mode. Mixed mode obviously also has an advantage on multiplier availability. Power factors are clear too: both QSOs and multipliers are a lot easier to obtain if you run high power. The figures imply that for every 100 QSOs made by a QRP station, a low power station should make 170 and a high power station 340 to remain on par (all other things being equal).

Normalised relative scores were formed by multiplying normalised QSO totals with the QSO factor of the appropriate category, multiplying normalised mult totals with the mult factor, then multiplying together normalised QSOs and normalised (boy, am I getting sick of typing "normalised"!) mults to get the relative scores ranked and listed by area.
The process was repeated, but this time the results were sorted into area categories (e.g. "Europe", "VE", "W5") in order to normalise for geography. I used the mode-power normalised QSOs and mults as the starting point here, the logic being that some areas had few contest entries or could otherwise be biased in the relative number of entries in each mode-power classes (e.g. North America has a higher proportion of multi-op stations). New factors were obtained and these used to obtain relative scores normalised for area. The new relative scores were ranked and listed by power-mode class.

The area factors include some unexplained patterns. In particular I am at a loss to suggest why factors for W2 are slightly but significantly higher than for W1 and W3. Indeed, W1-3 and W8-9 all have QSO factors greater than 1 suggesting the NE USA was relatively impoverished in the availability of stations to work. This may in part be a function of latitude. It is not surprising that southern states in W4 and W5 had a QSO-numbers advantage. But what about W7, W0 and VE that also had better-than-average QSO totals? Maybe its because these areas benefited from the sporadic E openings I seem to recall being reported? The western US was clearly less well off for multipliers than the east - presumably related to propagation access to Europe.

Europe is mostly quite high latitude, and us Europeans can find it harder to find QSOs, as suggested by the slightly surprisingly high QSO factor for Europe (perhaps due to the absence of sporadic E in Europe this time). But we did OK for multipliers. Southern continents (AF, SA) have good propagation to high population regions hence the low QSO and multiplier factors for these areas. Asia is the worst off continent having poorer than average access to QSOs and multipliers, while North America is the best off for both. I think this is what one would expect. On balance, the geographic normalisation works.

I know that many of the areas are large with diverse population densities and/or different propagation characteristics. This is a numbers of entries limitation. I used the same areas as ARRL do for listing scores, each of which happen to have a reasonable number of entries. Splitting into sub-areas might be an improvement if each area contained enough entries. There might be scope for splitting VE and, for example, lumping VE1-3 with W1 and W2 (as suggested by Bob VE3SRE) as these areas have more-or-less common populations and propagation conditions.

I am pleased that this process highlights outstanding efforts that are otherwise overlooked - an example is the low power CW score of JA1PS. But I am also aware that relative positions of stations can change with different normalising procedures. I suspect the chosen normalisation has been a little unfair to high powered entrants and to multi-op stations. This may be because these classes have been penalised for including a high proportion of contest-grade stations and top operators. Stations in North America may feel a little aggrieved by the severity of the penalties normalisation imposes on them. I have already mentioned W2 stations as probably getting a more than fair boost.

Other variants on the above methodology I tried (or might have tried) include standardisation (normalising using the difference from the mean divided by the standard deviation), log-normally transforming values before attempting normalisation, identifying outlier scores and removing these from calculations of the normalising factors, other "trimming" strategies, etc. I have not applied the statistical analysis I would employ if this was work and I was getting paid for it, and I am aware of the limitations this imposes. Improvements are possible. But I have other things to do.

The astute may have noticed that GØAEV's normalised relative placing is quite flattering whereas his real score isn't. Yes, there is some personal motive for doing all this. As spelt out in the introduction, I wanted to see how well I was doing relative to everyone else. The fact that I just missed a dx top-ten place in almost the largest entry class (low power phone, 11th out of 212) did spur me on! However, the normalisation method chosen was not influenced by any ulterior motives. I have tried to get the fairest result possible, and hope I have identified the areas where the method is weakest. I would be interested to hear from anyone who has constructive comments on the process or on the benefits or otherwise of normalising scores, but please don't flame me because your relative score is not what you'd expect.

Created 15 Aug 99
Steve Reed GØAEV
Email: g0aev@explore.force9.co.uk

Mode	QSO Factor	Mult Factor	Power	QSO Factor	Mult Factor
A - Mixed	1.2	0.8	A - QRP	2.4	1.4
B - Phone	1.4	1.3	B - Low	1.4	1.2
C - CW	1.9	1.4	C - High	0.7	0.9
D - Multi	0.5	0.6
Factors less than 1 mean QSOs or mults are reduced during normalisation and vice versa. These factors confirm that QSO totals are lowest for CW (that's why CW QSOs count double points!), higher for phone, highest for mixed mode. Mixed mode obviously also has an advantage on multiplier availability. Power factors are clear too: both QSOs and multipliers are a lot easier to obtain if you run high power. The figures imply that for every 100 QSOs made by a QRP station, a low power station should make 170 and a high power station 340 to remain on par (all other things being equal).