Generating and Evaluating an Automated Dutch Clinical QA Benchmark Grounded in the NHG Guidelines

Bachelor Thesis (2026)
Author(s)

A.H.C. Straathof (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

J. Yang – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Yannick ter Heerdt – Mentor

P.K. Murukannaiah – Graduation committee member (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Faculty
Electrical Engineering, Mathematics and Computer Science
More Info
expand_more
Publication Year
2026
Language
English
Graduation Date
26-06-2026
Awarding Institution
Delft University of Technology
Project
CSE3000 Research Project
Programme
Computer Science and Engineering
Faculty
Electrical Engineering, Mathematics and Computer Science
Downloads counter
6
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

This thesis investigates the use of Large Language Models (LLMs) to automatically generate and evaluate synthetic clinical question-answer benchmarks based on Dutch NHG guidelines. The goal is to build a reliable and reproducible Key Feature Question (KFQ) dataset for testing clinical reasoning. In the first phase, different prompting strategies were tested using gpt-4o-mini across a subset of guideline text. The results show that while baseline model extraction is highly stable, a hybrid few shot chain of thought strategy performs best, achieving the highest optimization score and strong factual grounding. With this prompting a strategy a final benchmark dataset of 375 fully traceable Dutch QA pairs was constructed.
In the second phase, the feasibility of automating the benchmark evaluation was tested by comparing out-of-the-box frameworks RAGAS and RAGChecker directly against the grading of a licensed general practitioner. A significant judgement gap was found between the automated tools and human expert judgment. RAGAS systematically overestimated safety because it relies on literal word overlap, making it completely miss dangerous clinical errors like recommending a treatment that was explicitly stated to be failing. RAGChecker heavily penalized safe clinical paraphrasing and conditional reasoning due to its rigid token-level claim parsing. Ultimately, this work provides a functional pipeline for creating Dutch medical benchmarks, but highlights that standard automated evaluation toolkits require custom, domain-specific calibration before they can reliably replace human expert judgment.

Files

Thesis_6008968_2026_v3.pdf
(pdf | 0.712 Mb)
License info not available