NT
N.J. Ter Heerdt
info
Please Note
<p>This page displays the records of the person named above and is not linked to a unique person identifier. This record may need to be merged to a profile.</p>
1 records found
1
Failure analysis of RAG in healthcare
Finding the most common failure modes of RAG systems with finetuning approaches
This study introduces a systematic, metric-driven failure taxonomy to identify and quantify errors across the document chunking, retrieval, and generation stages in retrieval-augmented generation (RAG) systems. We evaluate this framework on a benchmark derived from the Nederlandse Huisartsen Genootschap (NHG) protocols, comparing factual and clinical query settings. Our results show a substantial reduction in error-free performance when moving from factual tasks (137 error-free queries) to clinical scenarios (75 error-free queries). We further observe a shift in dominant failure modes: generation-level fabrications are most common in factual queries (14%), whereas clinical queries are dominated by missed retrievals (31%). Co-occurrence analysis reveals a strong association between retrieval failures and downstream generation errors, suggesting cascading effects across the pipeline. These findings highlight retrieval quality as the main bottleneck in clinical settings and motivate domain-specific retriever fine-tuning for safer deployment in Dutch primary care.
...
This study introduces a systematic, metric-driven failure taxonomy to identify and quantify errors across the document chunking, retrieval, and generation stages in retrieval-augmented generation (RAG) systems. We evaluate this framework on a benchmark derived from the Nederlandse Huisartsen Genootschap (NHG) protocols, comparing factual and clinical query settings. Our results show a substantial reduction in error-free performance when moving from factual tasks (137 error-free queries) to clinical scenarios (75 error-free queries). We further observe a shift in dominant failure modes: generation-level fabrications are most common in factual queries (14%), whereas clinical queries are dominated by missed retrievals (31%). Co-occurrence analysis reveals a strong association between retrieval failures and downstream generation errors, suggesting cascading effects across the pipeline. These findings highlight retrieval quality as the main bottleneck in clinical settings and motivate domain-specific retriever fine-tuning for safer deployment in Dutch primary care.