Evaluating Metric Sensitivity to Offline–Online Alignment in Information Retrieval

None, None

Evaluating Metric Sensitivity to Offline–Online Alignment in Information Retrieval

Bachelor Thesis (2026)

Author(s)

S. Udagawa (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

A. Anand – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

J. Urbano Merino – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Faculty

Electrical Engineering, Mathematics and Computer Science

Alignment Sensitivity Information retrieval Metric Offline evaluation Online evaluation

To reference this document use

https://resolver.tudelft.nl/uuid:cd60e8e4-b6fc-489f-880d-a17e0957d5d2

More Info

expand_more

Publication Year

2026

Language

English

Graduation Date

30-01-2026

Awarding Institution

Delft University of Technology

Project

CSE3000 Research Project

Programme

Computer Science and Engineering

Faculty

Electrical Engineering, Mathematics and Computer Science

Downloads counter

81

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

This study examines how effectively widely used offline information retrieval (IR) metrics reflect changes in online performance. As offline evaluation plays a central role in model development, understanding its alignment with user‑oriented signals is essential. Using 52 diverse ranking pipelines and approximately 2,000 queries from the MS MARCO DL19 and DL20 benchmarks, we analyze the sensitivity of five offline metrics: Precision@10, Recall@10, MAP, MRR, and NDCG@10, to five simulated online metrics: CTR, SSR, ZRR, ADT, and SAR. Sensitivity is quantified through slope-based analysis, and alignment is assessed using the Pearson correlation coefficient. Our results show that NDCG@10 and Recall@10 are the most sensitive offline metrics across multiple online behaviors, while Precision@10 consistently exhibits low sensitivity. Furthermore, we demonstrate that sensitivity and alignment capture complementary aspects of offline–online relationships: some metric pairs show strong responsiveness but weak linear consistency. Overall, this study provides a detailed and reproducible evaluation of how offline metrics behave in relation to simulated online performance, offering practical guidance for selecting offline metrics that better reflect user-centric outcomes.

https://github.com/AinzOoalGown123/Metric-Sensitivity-Analysis

Files

CSE3000_Final_Paper.pdf

(pdf | 4.59 Mb)

License info not available