How do Metric Score Distributions affect the Type i Error Rate of Statistical Significance Tests in Information Retrieval?

Title

How do Metric Score Distributions affect the Type i Error Rate of Statistical Significance Tests in Information Retrieval?

Author

Urbano, Julián (TU Delft Multimedia Computing)
Corsi, M. (TU Delft Multimedia Computing)
Hanjalic, A. (TU Delft Intelligent Systems)

Department

Intelligent Systems

Date

2021

Abstract

Statistical significance tests are the main tool that IR practitioners use to determine the reliability of their experimental evaluation results. The question of which test behaves best with IR evaluation data has been around for decades, and has seen all kinds of results and recommendations. Definitive answer to this question has recently been attempted via stochastic simulation of IR evaluation data, allowing researchers to compute actual Type I error rates because they can control the null hypothesis. One such research line simulates metric scores for a fixed set of systems on random topics, and concluded that the t-test behaves the best. Another such line simulates retrieval runs by random systems on a fixed set of topics, and concluded that the Wilcoxon test behaves the best. Interestingly, two recent surveys of the IR literature have shown that the community has a clear preference precisely for these two tests, so further investigation is critical to understand why the above simulation studies reach opposite conclusions. It has been recently postulated that a reason for the disagreement is the distributions of metric scores used by one of these simulation methods. In this paper we investigate this issue and extend the argument to another key aspect of the simulation, namely the dependence between systems. Following a principled approach, we analyze the robustness of statistical tests to different factors, thus identifying under what conditions they behave well or not with respect to the Type I error rate. Our results suggest that differences between the Wilcoxon and t-test may be explained by the skewness of score differences.

Subject

simulation
skewness
statistical significance
type I error

To reference this document use:

http://resolver.tudelft.nl/uuid:a3f14e6d-7147-4113-9e34-db31bc2ae418

DOI

https://doi.org/10.1145/3471158.3472242

Publisher

Association for Computing Machinery (ACM), New York

ISBN

978-1-4503-8611-1

Source

ICTIR 2021: Proceedings of the 2021 ACM SIGIR International Conference on Theory of Information Retrieval

Event

11th ACM SIGIR International Conference on Theory of Information Retrieval, ICTIR 2021, 2021-07-11, Virtual, Online, Canada

Part of collection

Institutional Repository

Document type

conference paper

Rights

© 2021 Julián Urbano, M. Corsi, A. Hanjalic

Files

PDF

3471158.3472242.pdf

2.01 MB

Close viewer