Automated Benchmark Construction for Factual Question Answering over NHG Guidelines

None, None

Automated Benchmark Construction for Factual Question Answering over NHG Guidelines

A Foundation for RAG Evaluation in Dutch Primary Care

Bachelor Thesis (2026)

Author(s)

C.K. Bakker (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

Yannick ter Heerdt – Mentor (Erasmus MC)

J. Yang – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

P.K. Murukannaiah – Graduation committee member (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Faculty

Electrical Engineering, Mathematics and Computer Science

Benchmark NHG RAG

To reference this document use

https://resolver.tudelft.nl/uuid:134dc0fe-7f7a-4e66-a2e4-dcbc7f0473c4

More Info

expand_more

Publication Year

2026

Language

English

Graduation Date

26-06-2026

Awarding Institution

Delft University of Technology

Project

CSE3000 Research Project

Programme

Computer Science and Engineering

Faculty

Electrical Engineering, Mathematics and Computer Science

Downloads counter

7

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Evaluating Retrieval-Augmented Generation systems in clinical domains requires
reliable benchmarks, yet constructing these manually is costly and infeasible at a large scale. This paper presents an automated pipeline for constructing and evaluating a factual question answering benchmark over Dutch primary care guidelines. The pipeline uses large language model based question-answer generation with few-shot and chain-of-thought prompting, combined with automated filtering using BERTScore grounding and round-trip consistency to produce high quality question-answer pairs. Human validation confirmed that the final benchmark of 192 question-answer pairs across 10 Nederlands Huisartsen Genootschap guidelines achieves factual correctness, retraceability and clinical relevance. The benchmark was integrated into a Retrieval-Augmented Generation pipeline to evaluate whether RAGChecker, a claim-level automated evaluation framework, could serve as a reliable alternative to human evaluation. RAGChecker
scores were consistent with human judgment though lower due to its strict claim-level checking. These results show that a reliable, automated benchmark can be constructed for Dutch primary care question answering and that RAGChecker serves as a reasonable but strict alternative for human evaluation of Retrieval-Augmented Generation systems in this domain.

Files

Final_Thesis_Paper.pdf

(pdf | 0.346 Mb)

License info not available