Open-domain question answering (ODQA) often requires integrating evidence from multiple sources and reasoning across several steps. While recent work has made progress on retrieval and reasoning independently, their combined optimization remains challenging. Standard retrieval me
...
Open-domain question answering (ODQA) often requires integrating evidence from multiple sources and reasoning across several steps. While recent work has made progress on retrieval and reasoning independently, their combined optimization remains challenging. Standard retrieval methods may fail to surface all relevant documents, while reasoning models can generate confident but incorrect answers when evidence is incomplete, noisy, or inconsistent. This limits both accuracy and the reliability of model predictions, particularly for multi-hop and compositional questions.
This thesis explores strategies to jointly enhance retrieval and reasoning for ODQA. On the retrieval side, we introduce a dynamic answer frontier mechanism that prioritizes candidate documents based on semantic consistency across multiple generated answers, guiding iterative document expansion over a retrieval graph. This consistency-driven approach improves recall by promoting documents aligned with the most reliable reasoning traces. On the reasoning side, we apply test-time scaling (TTS), generating multiple candidate answers per question and training a verifier model to select the most trustworthy one. The verifier evaluates both semantic correctness and grounding in retrieved evidence, mitigating the effects of misleading or irrelevant documents.
We evaluate the proposed approach on two challenging ODQA benchmarks, MuSiQue and 2WikiMultiHopQA, which require complex multi-hop reasoning and resist shortcut solutions. Experimental results show that our method improves evidence recall and downstream answer accuracy over strong baselines, including standard retrieval pipelines and semantic uncertainty-based re-ranking methods. Qualitative analysis reveals better handling of compositional queries, including temporal comparisons and multi-hop relational reasoning, along with improved resilience to noisy retrievals and reduced divergence from relevant evidence. The study also identifies remaining challenges, such as reliance on agreement as a proxy for correctness and the computational cost of TTS, pointing to future directions involving principled uncertainty measures, end-to-end feedback integration, and efficiency improvements for exploring larger reasoning spaces. Together, these findings underscore the value of integrating semantic consistency-driven retrieval with verifier-guided reasoning selection to advance robustness and trustworthiness in complex ODQA systems.