Print Email Facebook Twitter How do Metric Score Distributions affect the Type i Error Rate of Statistical Significance Tests in Information Retrieval? Title How do Metric Score Distributions affect the Type i Error Rate of Statistical Significance Tests in Information Retrieval? Author Urbano, Julián (TU Delft Multimedia Computing) Corsi, M. (TU Delft Multimedia Computing) Hanjalic, A. (TU Delft Intelligent Systems) Department Intelligent Systems Date 2021 Abstract Statistical significance tests are the main tool that IR practitioners use to determine the reliability of their experimental evaluation results. The question of which test behaves best with IR evaluation data has been around for decades, and has seen all kinds of results and recommendations. Definitive answer to this question has recently been attempted via stochastic simulation of IR evaluation data, allowing researchers to compute actual Type I error rates because they can control the null hypothesis. One such research line simulates metric scores for a fixed set of systems on random topics, and concluded that the t-test behaves the best. Another such line simulates retrieval runs by random systems on a fixed set of topics, and concluded that the Wilcoxon test behaves the best. Interestingly, two recent surveys of the IR literature have shown that the community has a clear preference precisely for these two tests, so further investigation is critical to understand why the above simulation studies reach opposite conclusions. It has been recently postulated that a reason for the disagreement is the distributions of metric scores used by one of these simulation methods. In this paper we investigate this issue and extend the argument to another key aspect of the simulation, namely the dependence between systems. Following a principled approach, we analyze the robustness of statistical tests to different factors, thus identifying under what conditions they behave well or not with respect to the Type I error rate. Our results suggest that differences between the Wilcoxon and t-test may be explained by the skewness of score differences. Subject simulationskewnessstatistical significancetype I error To reference this document use: http://resolver.tudelft.nl/uuid:a3f14e6d-7147-4113-9e34-db31bc2ae418 DOI https://doi.org/10.1145/3471158.3472242 Publisher Association for Computing Machinery (ACM), New York ISBN 978-1-4503-8611-1 Source ICTIR 2021: Proceedings of the 2021 ACM SIGIR International Conference on Theory of Information Retrieval Event 11th ACM SIGIR International Conference on Theory of Information Retrieval, ICTIR 2021, 2021-07-11, Virtual, Online, Canada Part of collection Institutional Repository Document type conference paper Rights © 2021 Julián Urbano, M. Corsi, A. Hanjalic Files PDF 3471158.3472242.pdf 2.01 MB Close viewer /islandora/object/uuid:a3f14e6d-7147-4113-9e34-db31bc2ae418/datastream/OBJ/view