Benchmarking Open-Source vs. Closed-Source LLMs for Dutch Medical Guidelines
Quantitative Evaluation of Retrieval-Augmented Generation using the NHG-Guidelines
A. Tageldin (TU Delft - Electrical Engineering, Mathematics and Computer Science)
J. Yang – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)
Yannick ter Heerdt – Mentor (Erasmus MC)
P.K. Murukannaiah – Graduation committee member (TU Delft - Electrical Engineering, Mathematics and Computer Science)
More Info
expand_more
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.
Abstract
Large Language Models (LLMs) integrated with Retrieval-Augmented Generation (RAG) offer promising clinical decision support capabilities. However, utilizing closed-source models for this purpose requires transmitting sensitive patient data to external servers, creating severe GDPR compliance and privacy risks. Conversely, open-source models can be securely hosted locally, but their clinical reasoning capabilities in non-English medical contexts remain unproven. This research quantitatively benchmarks the performance of three closed-source and three open-source LLMs operating over the Dutch NHG-guidelines. Using a standardized RAG pipeline and an automated "LLM-as-a-Judge" evaluation framework (RAGChecker), we analyze the exact trade-offs between clinical accuracy, computational cost, and inference speed. The results reveal a significant paradigm shift: top-tier open-source models (specifically DeepSeek V4 Pro) not only match but outperform closed-source models (GPT-5.5) in clinical accuracy and speed at a fraction of the cost, offering a highly viable, privacy-preserving alternative for Dutch healthcare institutions.