NT

N.J. Ter Heerdt

info

Please Note

1 records found

Finding the most common failure modes of RAG systems with finetuning approaches

This study introduces a systematic, metric-driven failure taxonomy to identify and quantify errors across the document chunking, retrieval, and generation stages in retrieval-augmented generation (RAG) systems. We evaluate this framework on a benchmark derived from the Nederlandse Huisartsen Genootschap (NHG) protocols, comparing factual and clinical query settings. Our results show a substantial reduction in error-free performance when moving from factual tasks (137 error-free queries) to clinical scenarios (75 error-free queries). We further observe a shift in dominant failure modes: generation-level fabrications are most common in factual queries (14%), whereas clinical queries are dominated by missed retrievals (31%). Co-occurrence analysis reveals a strong association between retrieval failures and downstream generation errors, suggesting cascading effects across the pipeline. These findings highlight retrieval quality as the main bottleneck in clinical settings and motivate domain-specific retriever fine-tuning for safer deployment in Dutch primary care. ...