Benchmarking Open-Source vs. Closed-Source LLMs for Dutch Medical Guidelines

Quantitative Evaluation of Retrieval-Augmented Generation using the NHG-Guidelines

Bachelor Thesis (2026)
Author(s)

A. Tageldin (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

J. Yang – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Yannick ter Heerdt – Mentor (Erasmus MC)

P.K. Murukannaiah – Graduation committee member (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Faculty
Electrical Engineering, Mathematics and Computer Science
More Info
expand_more
Publication Year
2026
Language
English
Graduation Date
26-06-2026
Awarding Institution
Delft University of Technology
Project
CSE3000 Research Project
Programme
Computer Science and Engineering
Faculty
Electrical Engineering, Mathematics and Computer Science
Downloads counter
10
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Large Language Models (LLMs) integrated with Retrieval-Augmented Generation (RAG) offer promising clinical decision support capabilities. However, utilizing closed-source models for this purpose requires transmitting sensitive patient data to external servers, creating severe GDPR compliance and privacy risks. Conversely, open-source models can be securely hosted locally, but their clinical reasoning capabilities in non-English medical contexts remain unproven. This research quantitatively benchmarks the performance of three closed-source and three open-source LLMs operating over the Dutch NHG-guidelines. Using a standardized RAG pipeline and an automated "LLM-as-a-Judge" evaluation framework (RAGChecker), we analyze the exact trade-offs between clinical accuracy, computational cost, and inference speed. The results reveal a significant paradigm shift: top-tier open-source models (specifically DeepSeek V4 Pro) not only match but outperform closed-source models (GPT-5.5) in clinical accuracy and speed at a fraction of the cost, offering a highly viable, privacy-preserving alternative for Dutch healthcare institutions.

Files

Final_Research_Paper_5_.pdf
(pdf | 0.977 Mb)
License info not available