Diagnosing Failure Patterns in Large Language Models

None, None

Diagnosing Failure Patterns in Large Language Models

A Symptom–Sign Framework and Integrated Toolkit for Practitioners

Master Thesis (2026)

Author(s)

J.S. Beekman (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

J. Yang – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

L. Corti – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

K.W. Song – Graduation committee member (TU Delft - Industrial Design Engineering)

Faculty

Electrical Engineering, Mathematics and Computer Science

Large Language Models (LLMs) Design Science Research Model Diagnosis Explainable AI (XAI) Codebook Thematic Analysi Practitioner Study Failure Pattern Diagnostic Toolkit

To reference this document use

https://resolver.tudelft.nl/uuid:6e5aac19-37da-4c80-907d-a09be0262f4d

More Info

expand_more

Publication Year

2026

Language

English

Graduation Date

20-04-2026

Awarding Institution

Delft University of Technology

Programme

Computer Science, Web Information Systems

Faculty

Electrical Engineering, Mathematics and Computer Science

Downloads counter

106

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Large language models (LLMs) are increasingly deployed in consequential settings, yet their failures remain challenging to understand. Unlike traditional software bugs, such undesirable behavior emerges from distributed, context-dependent interactions that resist straightforward debugging. While XAI methods can surface signals about individual predictions, they do not directly support the hypothesis-driven investigative process that characterizes diagnosis in practice: forming expectations, gathering evidence, and identifying recurring failure patterns.

This thesis addresses this gap by introducing (1) a diagnostic framework that structures diagnosis through symptoms (observed undesirable outputs), signs (evidence from interpretability methods), and failure patterns (recurring, explainable combinations of symptoms and signs), complemented by a Should-Know/Really-Know lens that distinguishes task expectation from actual model knowledge; and (2) a prototype diagnostic toolkit that operationalizes this framework through integrated evaluation, run comparison, and state externalization.

An evaluation study with eight practitioners using codebook thematic analysis reveals three core findings. First, practitioners universally adopt a baseline-first strategy, building diagnostic confidence through initial evaluation before deeper probing. Second, they triangulate across samples, metrics, and interpretability outputs rather than relying on single signals, using comparison as a central sense-making operation for hypothesis testing. Third, diagnostic depth is systematically gated by three factors: interpretation friction (insufficient guidance on what methods reveal and how to act on their outputs), missing workflow glue (the absence of affordances for iterative refinement), and execution constraints (opaque platform limits that disrupt sustained diagnostic progress).

These findings reframe diagnostic tooling as infrastructure for iterative, hypothesis-driven reasoning, extending beyond the provision of isolated analytical methods. Effective diagnostic support must scaffold the full investigative cycle: from expectation formation and baseline calibration through evidence triangulation, hypothesis testing via comparison, and state externalization. This scaffolding must also account for the gating factors that shape the depth of diagnostic progress. This positions diagnosis as a knowledge and workflow challenge, with implications for tooling design, framework development, and empirical research into practitioner diagnostic workflows.

Files

Diagnosing_Failure_Patterns_in... (pdf)

(pdf | 17.3 Mb)

License info not available