Diagnosing Failure Patterns in Large Language Models
A Symptom–Sign Framework and Integrated Toolkit for Practitioners
J.S. Beekman (TU Delft - Electrical Engineering, Mathematics and Computer Science)
J. Yang – Mentor (TU Delft - Web Information Systems)
L. Corti – Mentor (TU Delft - Web Information Systems)
K.W. Song – Graduation committee member (TU Delft - Knowledge and Intelligence Design)
More Info
expand_more
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.
Abstract
Large language models (LLMs) are increasingly deployed in consequential settings, yet their failures remain challenging to understand. Unlike traditional software bugs, such undesirable behavior emerges from distributed, context-dependent interactions that resist straightforward debugging. While XAI methods can surface signals about individual predictions, they do not directly support the hypothesis-driven investigative process that characterizes diagnosis in practice: forming expectations, gathering evidence, and identifying recurring failure patterns.
This thesis addresses this gap by introducing (1) a diagnostic framework that structures diagnosis through symptoms (observed undesirable outputs), signs (evidence from interpretability methods), and failure patterns (recurring, explainable combinations of symptoms and signs), complemented by a Should-Know/Really-Know lens that distinguishes task expectation from actual model knowledge; and (2) a prototype diagnostic toolkit that operationalizes this framework through integrated evaluation, run comparison, and state externalization.
An evaluation study with eight practitioners using codebook thematic analysis reveals three core findings. First, practitioners universally adopt a baseline-first strategy, building diagnostic confidence through initial evaluation before deeper probing. Second, they triangulate across samples, metrics, and interpretability outputs rather than relying on single signals, using comparison as a central sense-making operation for hypothesis testing. Third, diagnostic depth is systematically gated by three factors: interpretation friction (insufficient guidance on what methods reveal and how to act on their outputs), missing workflow glue (the absence of affordances for iterative refinement), and execution constraints (opaque platform limits that disrupt sustained diagnostic progress).
These findings reframe diagnostic tooling as infrastructure for iterative, hypothesis-driven reasoning, extending beyond the provision of isolated analytical methods. Effective diagnostic support must scaffold the full investigative cycle: from expectation formation and baseline calibration through evidence triangulation, hypothesis testing via comparison, and state externalization. This scaffolding must also account for the gating factors that shape the depth of diagnostic progress. This positions diagnosis as a knowledge and workflow challenge, with implications for tooling design, framework development, and empirical research into practitioner diagnostic workflows.