Efficient Test-Time Scaling for Fact Checking with Large Language Models

Master Thesis (2026)
Author(s)

S. Prakash (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

A. Anand – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Faculty
Electrical Engineering, Mathematics and Computer Science
More Info
expand_more
Publication Year
2026
Language
English
Graduation Date
17-06-2026
Awarding Institution
Delft University of Technology
Programme
Computer Science, Data Science and Technology, Computer Science, Artificial Intelligence
Faculty
Electrical Engineering, Mathematics and Computer Science
Downloads counter
36
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Large language models can support automated fact-checking by reasoning over claims and evidence, but single generated explanations can be unreliable for numerical, temporal, or complex claims. Verifier-guided Best-of-N (BoN) decoding improves performance by sampling multiple reasoning traces at inference time and selecting the trace most strongly supported by a verifier. However, exhaustive BoN applies the same fixed inference budget to every claim, making it costly even when a stable prediction emerges early. This thesis proposes a joint generation-and-verification optimization framework for dynamically allocating test-time computation in verifier-guided fact-checking through two methods. Verifier-Guided Adaptive Stopping uses verifier feedback as a confidence signal to stop generation once the prediction appears stable. Surrogate-Guided Selective Verification further uses an online linear surrogate model to estimate trace utility and prioritize which traces are sent for verification. A fixed verifier budget experiment then examines whether surrogate-guided allocation uses limited verifier calls more effectively than generation-order or random selection.
Experiments on QuanTemp and ClaimDecomp show that adaptive stopping reduces inference latency and estimated cost by 42.0% and 43.7%, respectively, while maintaining performance comparable to exhaustive BoN. Surrogate-guided verification reduces verifier calls by 44.5% and 53.7% relative to exhaustive BoN, with additional reductions of 9.4% and 4.0% over adaptive stopping. Under limited verifier budgets, surrogate-guided allocation often outperforms generation-order and random selection at lower budget levels across the evaluated settings, showing that learned trace ordering can improve verification when verifier calls are constrained. These results show that adaptive generation and selective verification preserve much of the benefit of multi-trace reasoning while substantially reducing inference cost, and that fixed-budget surrogate allocation can prioritize useful traces when verifier calls are limited.

Files

License info not available