by Fritzlein » Wed Dec 01, 2010 5:42 am
Clustering can indeed happen, but when ratings are calculated with BayesElo it isn't a problem. It is true that a group of identical bots could become tightly linked to each other by their results, but that means that when a bot within the group loses to a bot outside the group, the whole group moves down together. Similarly when a bot within the group beats a bot outside the group, the whole group moves up together. In short, what happens to a group of identical bots is basically what should happen to that group.
The least random, most accurate ranking would be created if every bot played every other bot on every map in the set. If this full round-robin were played, then there could be no concern about intransitivity conflicting with Elo ratings, since the Elo ratings would rank players exactly the same way as round-robin won-loss records. Unfortunately, resource constraints make a full round-robin impossible, so the decision before us is what subset of games to play.
The first set of games we want to throw out is mismatches. These games give very little information per game. Yes, it is true that an 800-point rating gap doesn't always mean a 100:1 chance of winning. However, if the true chance of winning is 97:4 instead, i.e. an implied 554-point Elo gap, we have played 101 games to discover a score difference of only 3 victories.
Contrast this to a close match where the ratings are supposedly only 118 points apart, for an implied winning ratio of 67:34. In the previous example our rating estimate was off by 246 points and that changed the score by only 3 victories. But here if the estimate is off by only 59 elo, i.e. if the true gap is 59 points, that changes the score by 8 victories to 59:42. We have played the same 101 games in light of a smaller rating error and have more change in score to show for it. That's good efficiency.
We all know that actual game results will not be transitive in the way that Elo ratings imply, but to get the most bang for the buck in detecting that intransitivity, we should not play a lot of mismatches. Instead we should play lots of close matches to make the most efficient use of server resources.
The second set of games we want to throw out is repeated pairings. It might turn out that for whatever reason a lower-rated bot scores well against a particular higher-rated bot. When this pairing occurs, it is a little bit of bad luck for the higher-rated bot and a little bit of good luck for the lower-rated bot. In a round-robin format, this kind of good and bad luck will even out over all, but if we allow repeated pairings before all bots have had a chance to play all other bots, we will amplify the luck inherent in opponent selection, thus making the final rankings more random. That's inefficiency.
Unfortunately, these two principles of efficient pairing are contradictory. If we avoid repeat pairings, we will be forced to have big mismatches. If we avoid big mismatches, we will be forced to have repeat pairings.
My opinion is that taking an extreme stance in either direction is suboptimal. The most efficient use of limited server resources will be to find some sort of balance. For example, one could absolutely avoid repeat pairings until that avoidance forced us into mismatches of 500 Elo or more. If (and only if) the only non-repeat pairings are outside of 500 Elo, then repeat pairings should be allowed.
My speculation on efficient pairing is quite separate from the issue of focusing server resources on the higher-ranked bots later in the tournament. The gradually shrinking field that has been implemented is a good idea independent of exactly how the remaining players in the field are paired.
Finally, as Janzert said, this is a very hard measurement problem. The fact that places #2 through #5 are churning doesn't mean the tournament is doing a bad job. Given two bots that are in truth 20 Elo points apart, you can let them play 100 times, and the better one is only expected to be ahead 53:47. That's only 0.6 standard deviations from equal. Or, to put it another way, the wrong bot will be ahead about 27% of the time even after 100 games. Obviously the more games you play, the better the estimate becomes, and the more the ratings settle down. You won't see nearly as much churning on the final day, after the top bots already have hundreds of games played each.