In a prediction tournament, contestants are tasked with predicting the distribution of a random variable. To determine which contestant makes the most accurate predictions, scores are assigned based on the outcomes of the random variables. The scoring rules are designed such that
...
In a prediction tournament, contestants are tasked with predicting the distribution of a random variable. To determine which contestant makes the most accurate predictions, scores are assigned based on the outcomes of the random variables. The scoring rules are designed such that a contestant’s expected score decreases as their predicted values approach the true distribution. This implies that the contestant with the lowest score should be the most accurate predictor. However, simulation results show that this is not the case. In this report, we found that for the common case of Bernoulli random variables, the true success probabilities affect the distribution of winners: it has a positive effect when the probability is closer to 0 or 1, and a negative effect when it is near 0.5. We also found that this distribution is not affected by whether contestant errors are drawn from a continuous distribution with fixed variance σ2 or are simply +σ or −σ. Furthermore, contestants who make extreme predictions (always predicting 0 or 1) do not outperform those who predict values close to the true success probability. While the choice of scoring rule does influence the distribution of winners, it does not eliminate the paradox. We found that the the Pseudospherical and Power score with parameter β close to 1, and the Logarithmic score performed the best. We extend our analysis to random variables with multiple categories. To support this extension, we introduce a new sampling method that builds on the one used in earlier simulations. In the binary model, we only needed one success probability for each random variable, but now we need multiple per random variable, while making sure that the sum of all the probabilities is exactly 1. Using a statistical distance, we determine how to model contestant predictions. For these random variables, we also analyze various scoring rules. In this case, we found that both the Pseudospherical score and the Power score, with β slightly larger than 1, and the Logarithmic score performed the best across various numbers of categories. Similarly, we extend our analysis to continuous random variables. Because of time constraints, we only look at Normal distributions with known variance. We use the same statistical distance as for the multi-categorical random variables, the total variation distance, to determine how to model contestant predictions. We again look at several scoring rules and found that the Power and Pseudospherical scoring rules for values of β close to 1 and the Logarithmic score, performed the best in this scenario.