Failure analysis of RAG in healthcare

finding the most common failure modes of RAG systems with finetuning approaches

Bachelor Thesis (2026)
Author(s)

N.I. Apawti (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

J. Yang – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

N.J. Ter Heerdt – Mentor (TU Delft - Industrial Design Engineering)

P.K. Murukannaiah – Graduation committee member (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Faculty
Electrical Engineering, Mathematics and Computer Science
More Info
expand_more
Publication Year
2026
Language
English
Graduation Date
26-06-2026
Awarding Institution
Delft University of Technology
Project
CSE3000 Research Project
Programme
Computer Science and Engineering
Faculty
Electrical Engineering, Mathematics and Computer Science
Downloads counter
3
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

This study introduces a systematic, metric-driven failure taxonomy to identify and quantify errors across the document chunking, retrieval, and generation stages in retrieval-augmented generation (RAG) systems. We evaluate this framework on a benchmark derived from the Nederlandse Huisartsen Genootschap (NHG) protocols, comparing factual and clinical query settings. Our results show a substantial reduction in error-free performance when moving from factual tasks (137 error-free queries) to clinical scenarios (75 error-free queries). We further observe a shift in dominant failure modes: generation-level fabrications are most common in factual queries (14%), whereas clinical queries are dominated by missed retrievals (31%). Co-occurrence analysis reveals a strong association between retrieval failures and downstream generation errors, suggesting cascading effects across the pipeline. These findings highlight retrieval quality as the main bottleneck in clinical settings and motivate domain-specific retriever fine-tuning for safer deployment in Dutch primary care.

Files

License info not available