AS
A.H.C. Straathof
info
Please Note
<p>This page displays the records of the person named above and is not linked to a unique person identifier. This record may need to be merged to a profile.</p>
1 records found
1
This thesis investigates the use of Large Language Models (LLMs) to automatically generate and evaluate synthetic clinical question-answer benchmarks based on Dutch NHG guidelines. The goal is to build a reliable and reproducible Key Feature Question (KFQ) dataset for testing clinical reasoning. In the first phase, different prompting strategies were tested using gpt-4o-mini across a subset of guideline text. The results show that while baseline model extraction is highly stable, a hybrid few shot chain of thought strategy performs best, achieving the highest optimization score and strong factual grounding. With this prompting a strategy a final benchmark dataset of 375 fully traceable Dutch QA pairs was constructed.
In the second phase, the feasibility of automating the benchmark evaluation was tested by comparing out-of-the-box frameworks RAGAS and RAGChecker directly against the grading of a licensed general practitioner. A significant judgement gap was found between the automated tools and human expert judgment. RAGAS systematically overestimated safety because it relies on literal word overlap, making it completely miss dangerous clinical errors like recommending a treatment that was explicitly stated to be failing. RAGChecker heavily penalized safe clinical paraphrasing and conditional reasoning due to its rigid token-level claim parsing. Ultimately, this work provides a functional pipeline for creating Dutch medical benchmarks, but highlights that standard automated evaluation toolkits require custom, domain-specific calibration before they can reliably replace human expert judgment. ...
In the second phase, the feasibility of automating the benchmark evaluation was tested by comparing out-of-the-box frameworks RAGAS and RAGChecker directly against the grading of a licensed general practitioner. A significant judgement gap was found between the automated tools and human expert judgment. RAGAS systematically overestimated safety because it relies on literal word overlap, making it completely miss dangerous clinical errors like recommending a treatment that was explicitly stated to be failing. RAGChecker heavily penalized safe clinical paraphrasing and conditional reasoning due to its rigid token-level claim parsing. Ultimately, this work provides a functional pipeline for creating Dutch medical benchmarks, but highlights that standard automated evaluation toolkits require custom, domain-specific calibration before they can reliably replace human expert judgment. ...
This thesis investigates the use of Large Language Models (LLMs) to automatically generate and evaluate synthetic clinical question-answer benchmarks based on Dutch NHG guidelines. The goal is to build a reliable and reproducible Key Feature Question (KFQ) dataset for testing clinical reasoning. In the first phase, different prompting strategies were tested using gpt-4o-mini across a subset of guideline text. The results show that while baseline model extraction is highly stable, a hybrid few shot chain of thought strategy performs best, achieving the highest optimization score and strong factual grounding. With this prompting a strategy a final benchmark dataset of 375 fully traceable Dutch QA pairs was constructed.
In the second phase, the feasibility of automating the benchmark evaluation was tested by comparing out-of-the-box frameworks RAGAS and RAGChecker directly against the grading of a licensed general practitioner. A significant judgement gap was found between the automated tools and human expert judgment. RAGAS systematically overestimated safety because it relies on literal word overlap, making it completely miss dangerous clinical errors like recommending a treatment that was explicitly stated to be failing. RAGChecker heavily penalized safe clinical paraphrasing and conditional reasoning due to its rigid token-level claim parsing. Ultimately, this work provides a functional pipeline for creating Dutch medical benchmarks, but highlights that standard automated evaluation toolkits require custom, domain-specific calibration before they can reliably replace human expert judgment.
In the second phase, the feasibility of automating the benchmark evaluation was tested by comparing out-of-the-box frameworks RAGAS and RAGChecker directly against the grading of a licensed general practitioner. A significant judgement gap was found between the automated tools and human expert judgment. RAGAS systematically overestimated safety because it relies on literal word overlap, making it completely miss dangerous clinical errors like recommending a treatment that was explicitly stated to be failing. RAGChecker heavily penalized safe clinical paraphrasing and conditional reasoning due to its rigid token-level claim parsing. Ultimately, this work provides a functional pipeline for creating Dutch medical benchmarks, but highlights that standard automated evaluation toolkits require custom, domain-specific calibration before they can reliably replace human expert judgment.