Benchmarking Open-Source vs. Closed-Source LLMs for Dutch Medical Guidelines

None, None

Benchmarking Open-Source vs. Closed-Source LLMs for Dutch Medical Guidelines

Quantitative Evaluation of Retrieval-Augmented Generation using the NHG-Guidelines

Bachelor Thesis (2026)

Author(s)

A. Tageldin (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

J. Yang – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Yannick ter Heerdt – Mentor (Erasmus MC)

P.K. Murukannaiah – Graduation committee member (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Faculty

Electrical Engineering, Mathematics and Computer Science

Large Language Models GDPR Medical AI NHG RAG Retrieval-Augmented Generation Data Privacy Clinical Decision Support LLM Benchmarking Open-Source AI Dutch Medical Guidelines LLM-as-a-Judge

To reference this document use

https://resolver.tudelft.nl/uuid:c7f46725-06ec-4759-bfeb-52e4d5f70bf0

More Info

expand_more

Publication Year

2026

Language

English

Graduation Date

26-06-2026

Awarding Institution

Delft University of Technology

Project

CSE3000 Research Project

Programme

Computer Science and Engineering

Faculty

Electrical Engineering, Mathematics and Computer Science

Downloads counter

33

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Large Language Models (LLMs) integrated with Retrieval-Augmented Generation (RAG) offer promising clinical decision support capabilities. However, utilizing closed-source models for this purpose requires transmitting sensitive patient data to external servers, creating severe GDPR compliance and privacy risks. Conversely, open-source models can be securely hosted locally, but their clinical reasoning capabilities in non-English medical contexts remain unproven. This research quantitatively benchmarks the performance of three closed-source and three open-source LLMs operating over the Dutch NHG-guidelines. Using a standardized RAG pipeline and an automated "LLM-as-a-Judge" evaluation framework (RAGChecker), we analyze the exact trade-offs between clinical accuracy, computational cost, and inference speed. The results reveal a significant paradigm shift: top-tier open-source models (specifically DeepSeek V4 Pro) not only match but outperform closed-source models (GPT-5.5) in clinical accuracy and speed at a fraction of the cost, offering a highly viable, privacy-preserving alternative for Dutch healthcare institutions.

Files

Final_Research_Paper_5_.pdf

(pdf | 0.977 Mb)

License info not available