Failure analysis of RAG in healthcare

None, None

Failure analysis of RAG in healthcare

finding the most common failure modes of RAG systems with finetuning approaches

Bachelor Thesis (2026)

Author(s)

N.I. Apawti (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

J. Yang – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

N.J. Ter Heerdt – Mentor (TU Delft - Industrial Design Engineering)

P.K. Murukannaiah – Graduation committee member (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Faculty

Electrical Engineering, Mathematics and Computer Science

Retrieval-Augmented Generation (RAG) Failure Taxonomy Healthcare Informatics

To reference this document use

https://resolver.tudelft.nl/uuid:1bb49381-50af-4a34-b329-504fd6bf7232

More Info

expand_more

Publication Year

2026

Language

English

Graduation Date

26-06-2026

Awarding Institution

Delft University of Technology

Project

CSE3000 Research Project

Programme

Computer Science and Engineering

Faculty

Electrical Engineering, Mathematics and Computer Science

Downloads counter

28

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

This study introduces a systematic, metric-driven failure taxonomy to identify and quantify errors across the document chunking, retrieval, and generation stages in retrieval-augmented generation (RAG) systems. We evaluate this framework on a benchmark derived from the Nederlandse Huisartsen Genootschap (NHG) protocols, comparing factual and clinical query settings. Our results show a substantial reduction in error-free performance when moving from factual tasks (137 error-free queries) to clinical scenarios (75 error-free queries). We further observe a shift in dominant failure modes: generation-level fabrications are most common in factual queries (14%), whereas clinical queries are dominated by missed retrievals (31%). Co-occurrence analysis reveals a strong association between retrieval failures and downstream generation errors, suggesting cascading effects across the pipeline. These findings highlight retrieval quality as the main bottleneck in clinical settings and motivate domain-specific retriever fine-tuning for safer deployment in Dutch primary care.

Files

THESIS_PROJECTO_CSE_3000_10_.p... (pdf)

(pdf | 0.376 Mb)

License info not available