Automated Benchmark Construction for Factual Question Answering over NHG Guidelines

A Foundation for RAG Evaluation in Dutch Primary Care

Bachelor Thesis (2026)
Author(s)

C.K. Bakker (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

Yannick ter Heerdt – Mentor (Erasmus MC)

J. Yang – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

P.K. Murukannaiah – Graduation committee member (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Faculty
Electrical Engineering, Mathematics and Computer Science
More Info
expand_more
Publication Year
2026
Language
English
Graduation Date
26-06-2026
Awarding Institution
Delft University of Technology
Project
CSE3000 Research Project
Programme
Computer Science and Engineering
Faculty
Electrical Engineering, Mathematics and Computer Science
Downloads counter
7
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Evaluating Retrieval-Augmented Generation systems in clinical domains requires
reliable benchmarks, yet constructing these manually is costly and infeasible at a large scale. This paper presents an automated pipeline for constructing and evaluating a factual question answering benchmark over Dutch primary care guidelines. The pipeline uses large language model based question-answer generation with few-shot and chain-of-thought prompting, combined with automated filtering using BERTScore grounding and round-trip consistency to produce high quality question-answer pairs. Human validation confirmed that the final benchmark of 192 question-answer pairs across 10 Nederlands Huisartsen Genootschap guidelines achieves factual correctness, retraceability and clinical relevance. The benchmark was integrated into a Retrieval-Augmented Generation pipeline to evaluate whether RAGChecker, a claim-level automated evaluation framework, could serve as a reliable alternative to human evaluation. RAGChecker
scores were consistent with human judgment though lower due to its strict claim-level checking. These results show that a reliable, automated benchmark can be constructed for Dutch primary care question answering and that RAGChecker serves as a reasonable but strict alternative for human evaluation of Retrieval-Augmented Generation systems in this domain.

Files

Final_Thesis_Paper.pdf
(pdf | 0.346 Mb)
License info not available