Efficient Test-Time Scaling for Fact Checking with Large Language Models

None, None

Efficient Test-Time Scaling for Fact Checking with Large Language Models

Master Thesis (2026)

Author(s)

S. Prakash (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

A. Anand – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Faculty

Electrical Engineering, Mathematics and Computer Science

Large language models Automated fact-checking Test-time scaling Verifier-guided inference Adaptive inference Selective verification

To reference this document use

https://resolver.tudelft.nl/uuid:a2672fab-be8e-44a3-9a1d-5191d15d8d5c

More Info

expand_more

Publication Year

2026

Language

English

Graduation Date

17-06-2026

Awarding Institution

Delft University of Technology

Programme

Computer Science, Data Science and Technology, Computer Science, Artificial Intelligence

Faculty

Electrical Engineering, Mathematics and Computer Science

Downloads counter

36

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Large language models can support automated fact-checking by reasoning over claims and evidence, but single generated explanations can be unreliable for numerical, temporal, or complex claims. Verifier-guided Best-of-N (BoN) decoding improves performance by sampling multiple reasoning traces at inference time and selecting the trace most strongly supported by a verifier. However, exhaustive BoN applies the same fixed inference budget to every claim, making it costly even when a stable prediction emerges early. This thesis proposes a joint generation-and-verification optimization framework for dynamically allocating test-time computation in verifier-guided fact-checking through two methods. Verifier-Guided Adaptive Stopping uses verifier feedback as a confidence signal to stop generation once the prediction appears stable. Surrogate-Guided Selective Verification further uses an online linear surrogate model to estimate trace utility and prioritize which traces are sent for verification. A fixed verifier budget experiment then examines whether surrogate-guided allocation uses limited verifier calls more effectively than generation-order or random selection.
Experiments on QuanTemp and ClaimDecomp show that adaptive stopping reduces inference latency and estimated cost by 42.0% and 43.7%, respectively, while maintaining performance comparable to exhaustive BoN. Surrogate-guided verification reduces verifier calls by 44.5% and 53.7% relative to exhaustive BoN, with additional reductions of 9.4% and 4.0% over adaptive stopping. Under limited verifier budgets, surrogate-guided allocation often outperforms generation-order and random selection at lower budget levels across the evaluated settings, showing that learned trace ordering can improve verification when verifier calls are constrained. These results show that adaptive generation and selective verification preserve much of the benefit of multi-trace reasoning while substantially reducing inference cost, and that fixed-budget surrogate allocation can prioritize useful traces when verifier calls are limited.

Files

Master_Thesis_Report_Sowmya.pd... (pdf)

(pdf | 4.56 Mb)

License info not available