by fourmidable » Wed Dec 21, 2011 1:11 am
I believe the fairest way to proceed with the ranking and the cutoff of the bots is to use mu instead of skill.
Let me explain why this makes sense. First I'd like to recap a little on the TrueSkill algorithm for those who are unfamiliar with it. I got a pretty good look at how it works as part of Kaggle's Deloitte/FIDE Chess Rating Challenge. TrueSkill is essentially a Bayesian estimator of a player strength. TrueSkill is superior to other simpler ratings system like ELO in two ways: first there is provision for teams and multi-player games, and it converge faster than ELO. A large part of the advantage of TrueSkill is that it tracks both strength (mu) and uncertainty (sigma) of each participant. The player strength is modeled as a Gaussian (normal) distribution of mean mu and standard distribution sigma.
In Ants, mu is initialized to 50 and sigma to 50/3 ~= 16.67. From a Bayesian perspective, this is the prior distribution. After each game, TrueSkill compares the outcome of the game with the expected results and computes a new distribution mu/sigma that maximizes the probability of the strength for each player. The research paper reports that good convergence occurs within 20 games typically.
Now, ants leaderboard is not sorted by program strength estimate (mu), but instead by a metric called 'skill' defined as mu-3*sigma. Notice how a new submission gets a skill of zero. As gamed are played, the uncertainty in strength decreases, reducing sigma, and thus the 'skill' typically increases (unless mu drops precipitously) . Skill is nice to maintain a leaderboard so that rough initial mu estimates don't suddenly appear on top, and force the program to slowly raise through the ranks over time. However, for the final, this property is undesirable.
The pairing algorithm favors adding players with low sigma in other player's games. As more games are played, sigma becomes lower and even more game are played, improving 'skill' but not 'mu' as much. It would be unfair to compare 'skill' between players that have played different number of games, but comparing 'mu' is the best comparison. Remember 'mu' is really the statistically most probable measure of the strength of the player, and much less sensitive to the number of game played. There is still a small bias introduce by the priors, negative for top players and positive for bottom players, but this effect should be small once 10 games are played. Skill is an artificial measure designed to lower new submission, that essentially says there is 99.85% chance the player strength is actually above this value (3 sigma).
For these reason a suggest switching on a 'mu' based ranking for the finals.