M. Corsi
Please Note
3 records found
1
Rank-Biased Overlap (RBO) is a popular measure of the similarity between two rankings. A key characteristic of RBO is that it can be computed even when the rankings are not fully seen and only a prefix is known, but this introduces uncertainty in the computation. In such cases, one would normally compute the point estimate RBOEXT, as well as bounds representing the best and worst cases; their difference is thus a residual quantifying the amount of uncertainty. Another source of uncertainty is the presence of tied items, because their actual relative order is unknown. Current approaches to this issue similarly provide a point estimate by considering the average RBO score over all the permutations of the ties, such as RBOa. However, there is currently no approach to quantify and bound the uncertainty due to ties, just as there is for the uncertainty due to unseen items. In this paper we fill this gap and provide algorithmic solutions to the problem of finding the arrangements of tied items that yield the lowest and highest possible RBO scores, naturally leading to total bounds and residuals. We also show that the current RBOa estimate only equals the average RBO over permutations when the rankings have the same length, so we also generalize it to rankings of different lengths. In summary, this work provides a full account for the uncertainty in RBO, allowing practitioners to make more sensible decisions on the grounds of rank similarity. The main realization is that residuals can actually be much larger once we account for both sources of uncertainty. To illustrate this, we present empirical results using both synthetic and TREC data, demonstrating that a realistic picture for the residual of RBO can only be provided by considering both sources of uncertainty.
Rank-Biased Overlap (RBO) is a similarity measure for indefinite rankings: it is top-weighted, and can be computed when only a prefix of the rankings is known or when they have only some items in common. It is widely used for instance to analyze differences between search engines by comparing the rankings of documents they retrieve for the same queries. In these situations, though, it is very frequent to find tied documents that have the same score. Unfortunately, the treatment of ties in RBO remains superficial and incomplete, in the sense that it is not clear how to calculate it from the ranking prefixes only. In addition, the existing way of dealing with ties is very different from the one traditionally followed in the field of Statistics, most notably found in rank correlation coefficients such as Kendall's and Spearman's. In this paper we propose a generalized formulation for RBO to handle ties, thanks to which we complete the original definitions by showing how to perform prefix evaluation. We also use it to fully develop two variants that align with the ones found in the Statistics literature: one when there is a reference ranking to compare to, and one when there is not. Overall, these three variants provide researchers with flexibility when comparing rankings with RBO, by clearly determining what ties mean, and how they should be treated. Finally, using both synthetic and TREC data, we demonstrate the use of these new tie-aware RBO measures. We show that the scores may differ substantially from the original tie-unaware RBO measure, where ties had to be broken at random or by arbitrary criteria such as by document ID. Overall, these results evidence the need for a proper account of ties in rank similarity measures such as RBO.
Statistical significance tests are the main tool that IR practitioners use to determine the reliability of their experimental evaluation results. The question of which test behaves best with IR evaluation data has been around for decades, and has seen all kinds of results and recommendations. Definitive answer to this question has recently been attempted via stochastic simulation of IR evaluation data, allowing researchers to compute actual Type I error rates because they can control the null hypothesis. One such research line simulates metric scores for a fixed set of systems on random topics, and concluded that the t-test behaves the best. Another such line simulates retrieval runs by random systems on a fixed set of topics, and concluded that the Wilcoxon test behaves the best. Interestingly, two recent surveys of the IR literature have shown that the community has a clear preference precisely for these two tests, so further investigation is critical to understand why the above simulation studies reach opposite conclusions. It has been recently postulated that a reason for the disagreement is the distributions of metric scores used by one of these simulation methods. In this paper we investigate this issue and extend the argument to another key aspect of the simulation, namely the dependence between systems. Following a principled approach, we analyze the robustness of statistical tests to different factors, thus identifying under what conditions they behave well or not with respect to the Type I error rate. Our results suggest that differences between the Wilcoxon and t-test may be explained by the skewness of score differences.