Exploring Copula-Based Models for the Stochastic Simulation of Information Retrieval Evaluation Data

More Info
expand_more

Abstract

In the field of Information Retrieval (IR), the reliable evaluation of systems is a key component in order to progress the state-of-the-art. Much of IR research focuses on optimizing the various aspects of evaluation. Stochastic simulation is one technique that can be used to assist this kind of research. It allows researchers to overcome certain limitations associated with IR data, such as limited size, and lack of control. Recently, there have been two parallel lines of work that use stochastic simulation to study the question of "which statistical significance test is optimal for IR evaluation data?". Surprisingly, the authors reach different conclusions, despite the fact that both use stochastic simulation. One line of work, lead by Urbano et al., simulates scores for a fixed set of systems on new random topics, and concluded that the t-test is optimal. Another line of work, lead by Parapar et al., simulates new random retrieval runs for a fixed set of topics, and concluded that the Wilcoxon test is optimal. Interestingly these two tests are the most popular in IR literature. In an attempt to shed some light on this disagreement between the two conclusions, we made a first attempt at providing some empirical evidence regarding the quality of the simulation approach that was used by Urbano et al. Our main findings is that the quality of the simulation is moderately good, and also discovered some opportunities to refine it. In addition, we proposed a new model selection criterion, that showed some promising results, and in many cases managed to select models more optimally than other, more established criteria, such as AIC.