Investigating Scoping Errors in Open-Domain Numerical Reasoning

Master Thesis (2026)
Author(s)

M.C. Smink (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

A. Anand – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

S.E. Verwer – Graduation committee member (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Faculty
Electrical Engineering, Mathematics and Computer Science
More Info
expand_more
Publication Year
2026
Language
English
Graduation Date
24-06-2026
Awarding Institution
Delft University of Technology
Programme
Computer Science, Data Science and Technology
Faculty
Electrical Engineering, Mathematics and Computer Science
Downloads counter
24
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Numerical information shapes how people interpret real-world events, evaluate claims, and make decisions. As Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) systems are increasingly used for open-domain information seeking, their ability to reason reliably over quantities has become critical. Yet, even when models retrieve relevant evidence and produce fluent chain-of-thought reasoning, they still often reach incorrect conclusions by failing to reason about numbers in the right contexts. This work investigates numerical scoping errors, an underexplored reasoning failure where a model incorrectly binds a quantity to the wrong referent, time-frame, unit, population, condition, or source. In high impact settings, a single mis-scoped number can change the meaning of a conclusion, turning a statistic into a misleading financial signal, an unsafe medical recommendation, or a false public claim. This makes numerical scoping errors an important reasoning failure to investigate. When performing numerical fact-checking (FC) and question-answering (QA), this work identifies scoping errors as a common and consequential failure mode in generated LLM reasoning traces across tasks. In human analysis, scoping errors appeared in 44.7% of numerical FC and 26.8% of numerical QA traces, with 35.3% and 81.8% of these errors, respectively, judged to contribute to incorrect final answers. To detect these failures at scale, this work develops a hybrid LLM-as-a-Judge pipeline that decomposes reasoning traces into quantities, verifies whether each quantity is correctly scoped based on its source type, and aggregates these quantity-level judgements into trace-level labels. The pipeline achieves moderate agreement with human annotations, but can be expensive to leverage. To explore more efficient alternatives, the hybrid pipeline's scoping detection signal is distilled into smaller, conservative student models which process traces around 212 times faster. Finally, this work evaluates whether scoping-aware reward signals can improve downstream performance in a parallel test-time scaling setting, where the system chooses among multiple generated reasoning traces, producing modest improvements, especially when combined with correctness-based selection. Overall, this work shows that numerical reasoning failures are not merely errors of calculation but errors of context. By making these contextual failures visible and measurable, this work takes a step toward numerical reasoning systems that can support high-impact, real-world decisions not only because they calculate correctly, but because they understand what their numbers mean.

Files

MCSmink_TUDelft_Thesis.pdf
(pdf | 2.98 Mb)
License info not available