L. Corti | TU Delft Repository

Diagnosing Failure Patterns in Large Language Models

A Symptom–Sign Framework and Integrated Toolkit for Practitioners

Master thesis (2026) - J.S. Beekman, J. Yang, L. Corti, K.W. Song

Large language models (LLMs) are increasingly deployed in consequential settings, yet their failures remain challenging to understand. Unlike traditional software bugs, such undesirable behavior emerges from distributed, context-dependent interactions that resist straightforward debugging. While XAI methods can surface signals about individual predictions, they do not directly support the hypothesis-driven investigative process that characterizes diagnosis in practice: forming expectations, gathering evidence, and identifying recurring failure patterns.

This thesis addresses this gap by introducing (1) a diagnostic framework that structures diagnosis through symptoms (observed undesirable outputs), signs (evidence from interpretability methods), and failure patterns (recurring, explainable combinations of symptoms and signs), complemented by a Should-Know/Really-Know lens that distinguishes task expectation from actual model knowledge; and (2) a prototype diagnostic toolkit that operationalizes this framework through integrated evaluation, run comparison, and state externalization.

An evaluation study with eight practitioners using codebook thematic analysis reveals three core findings. First, practitioners universally adopt a baseline-first strategy, building diagnostic confidence through initial evaluation before deeper probing. Second, they triangulate across samples, metrics, and interpretability outputs rather than relying on single signals, using comparison as a central sense-making operation for hypothesis testing. Third, diagnostic depth is systematically gated by three factors: interpretation friction (insufficient guidance on what methods reveal and how to act on their outputs), missing workflow glue (the absence of affordances for iterative refinement), and execution constraints (opaque platform limits that disrupt sustained diagnostic progress).

These findings reframe diagnostic tooling as infrastructure for iterative, hypothesis-driven reasoning, extending beyond the provision of isolated analytical methods. Effective diagnostic support must scaffold the full investigative cycle: from expectation formation and baseline calibration through evidence triangulation, hypothesis testing via comparison, and state externalization. This scaffolding must also account for the gating factors that shape the depth of diagnostic progress. This positions diagnosis as a knowledge and workflow challenge, with implications for tooling design, framework development, and empirical research into practitioner diagnostic workflows.
...

Large language models (LLMs) are increasingly deployed in consequential settings, yet their failures remain challenging to understand. Unlike traditional software bugs, such undesirable behavior emerges from distributed, context-dependent interactions that resist straightforward debugging. While XAI methods can surface signals about individual predictions, they do not directly support the hypothesis-driven investigative process that characterizes diagnosis in practice: forming expectations, gathering evidence, and identifying recurring failure patterns.

This thesis addresses this gap by introducing (1) a diagnostic framework that structures diagnosis through symptoms (observed undesirable outputs), signs (evidence from interpretability methods), and failure patterns (recurring, explainable combinations of symptoms and signs), complemented by a Should-Know/Really-Know lens that distinguishes task expectation from actual model knowledge; and (2) a prototype diagnostic toolkit that operationalizes this framework through integrated evaluation, run comparison, and state externalization.

An evaluation study with eight practitioners using codebook thematic analysis reveals three core findings. First, practitioners universally adopt a baseline-first strategy, building diagnostic confidence through initial evaluation before deeper probing. Second, they triangulate across samples, metrics, and interpretability outputs rather than relying on single signals, using comparison as a central sense-making operation for hypothesis testing. Third, diagnostic depth is systematically gated by three factors: interpretation friction (insufficient guidance on what methods reveal and how to act on their outputs), missing workflow glue (the absence of affordances for iterative refinement), and execution constraints (opaque platform limits that disrupt sustained diagnostic progress).

These findings reframe diagnostic tooling as infrastructure for iterative, hypothesis-driven reasoning, extending beyond the provision of isolated analytical methods. Effective diagnostic support must scaffold the full investigative cycle: from expectation formation and baseline calibration through evidence triangulation, hypothesis testing via comparison, and state externalization. This scaffolding must also account for the gating factors that shape the depth of diagnostic progress. This positions diagnosis as a knowledge and workflow challenge, with implications for tooling design, framework development, and empirical research into practitioner diagnostic workflows.

Data Model for Computer Vision Explainability, Fairness, and Robustness

Master thesis (2023) - Simran Karnani, J. Yang, A.M.A. Balayn, L. Corti, A. Anand, Q. Wang

In recent years, there has been a growing interest among researchers in the explainability, fairness, and robustness of Computer Vision models. While studies have explored the usability of these models for end users, limited research has delved into the challenges and requirements faced by researchers investigating these requirements. This study addresses this gap through a mixed-method approach, involving 20 semi-structured interviews with researchers and a comprehensive literature analysis.
Through this investigation, we have identified a practical need for a data model that encompasses the essential information researchers require to enhance explainability, fairness, and robustness in Computer Vision applications. We developed a data model that holds the potential to improve transparency and reproducibility within this field, speed up the research process, and facilitate comprehensive evaluations, whether quantitative or qualitative, of proposed methodologies. To refine and demonstrate the practicality of the data model, we have populated it with four existing datasets. Additionally, we have conducted two user studies to validate the model's usability. We found that participants were enthusiastic about using the data model. Some potential uses described by the participants were comparing models and datasets, accessing (niche) datasets and models, creating and exploring datasets, and having access to ground truth explanations. However, participants also had concerns about the data model, mainly with its usability being restricted to people with database knowledge and the richness of data in the database. Nonetheless, hope that this research constitutes the first step for data modelling for researchers in the field of Trustworthy AI. ...

Clearing the Air: An Exploration of Pulmonologists' Needs and Intents in XAI Solutions for Respiratory Medicine

Master thesis (2023) - R.F.A. Oltmans, C. Lofi, J. Yang, L. Corti, Jiwon Jung

Despite the low adoption rates of artificial intelligence (AI) in respiratory medicine, its potential to improve patient outcomes is substantial. To facilitate the integration of AI systems into the clinical setting, it is essential to prioritise the development of explainable AI (XAI) solutions that improve the understanding of the AI predictions. These XAI solutions empower clinicians to collaborate effectively with AI systems, thereby enhancing the overall outcomes for patients in respiratory medicine. Unfortunately, the lack of user-centric studies in this domain has made it challenging to identify the specific aspects of explainability that are most effective in improving the adoption of AI in the real-world environment. To address this gap, we conducted a mixed-methods study of clinicians in respiratory medicine to identify the most relevant and crucial aspects of XAI solutions. Our study focused on understanding how XAI can be effectively translated into clinical practice by leveraging the expertise of doctors in the field. Because of the lack of knowledge about XAI concepts among pulmonologists a different approach is taken to regular user-centric XAI research and no direct examples of state-of-the-art XAI solutions are used. Rather the expertise of doctors is used to make them implicitly identify their needs and intents. Our findings reveal that the successful adoption of XAI solutions in respiratory medicine requires tailored solutions that address communication barriers, promote patient-centric care, and overcome AI adoption challenges. The study highlights the significance of task-specific visualisations, comprehensive explanations, preferred granularity, and the ability to mimic human judgement in successful XAI solutions. Trust and collaboration between clinicians and AI systems are essential for effective adoption, wherein AI is perceived as a colleague rather than a replacement. This ensures that clinicians can easily understand and work with the model predictions, ultimately leading to improved patient outcomes. By aligning XAI design with the needs and intents of pulmonologists, we established the importance of Co-designing solutions with domain experts and embedding XAI within clinical workflows emerged as key strategies. Our research underscores the imperative of transparency, extended validation, and continuous alignment of AI technologies with medical values. By following these principles, XAI solutions can be developed to enhance the diagnosis and treatment of respiratory illnesses, ultimately improving patient outcomes in respiratory medicine. ...

Despite the low adoption rates of artificial intelligence (AI) in respiratory medicine, its potential to improve patient outcomes is substantial. To facilitate the integration of AI systems into the clinical setting, it is essential to prioritise the development of explainable AI (XAI) solutions that improve the understanding of the AI predictions. These XAI solutions empower clinicians to collaborate effectively with AI systems, thereby enhancing the overall outcomes for patients in respiratory medicine. Unfortunately, the lack of user-centric studies in this domain has made it challenging to identify the specific aspects of explainability that are most effective in improving the adoption of AI in the real-world environment. To address this gap, we conducted a mixed-methods study of clinicians in respiratory medicine to identify the most relevant and crucial aspects of XAI solutions. Our study focused on understanding how XAI can be effectively translated into clinical practice by leveraging the expertise of doctors in the field. Because of the lack of knowledge about XAI concepts among pulmonologists a different approach is taken to regular user-centric XAI research and no direct examples of state-of-the-art XAI solutions are used. Rather the expertise of doctors is used to make them implicitly identify their needs and intents. Our findings reveal that the successful adoption of XAI solutions in respiratory medicine requires tailored solutions that address communication barriers, promote patient-centric care, and overcome AI adoption challenges. The study highlights the significance of task-specific visualisations, comprehensive explanations, preferred granularity, and the ability to mimic human judgement in successful XAI solutions. Trust and collaboration between clinicians and AI systems are essential for effective adoption, wherein AI is perceived as a colleague rather than a replacement. This ensures that clinicians can easily understand and work with the model predictions, ultimately leading to improved patient outcomes. By aligning XAI design with the needs and intents of pulmonologists, we established the importance of Co-designing solutions with domain experts and embedding XAI within clinical workflows emerged as key strategies. Our research underscores the imperative of transparency, extended validation, and continuous alignment of AI technologies with medical values. By following these principles, XAI solutions can be developed to enhance the diagnosis and treatment of respiratory illnesses, ultimately improving patient outcomes in respiratory medicine.

A System for Model Diagnosis centered around Human Computation

Master thesis (2023) - Z. Ziad Ahmad Saad Soliman Nawar, J. Yang, A.M.A. Balayn, L. Corti, A. Anand, E. Isufi

Machine learning (ML) systems for computer vision applications are widely deployed in decision-making contexts, including high-stakes domains such as autonomous driving and medical diagnosis. While largely accelerating the decision-making process, those systems have been found to suffer from a severe issue of reliability, i.e., they can easily fail on serving data that are slightly different from the data captured during their training phase. Such an issue has resulted in undesired outcomes with safety, ethical, and societal concerns across various applications, such as numerous examples of semi-automatic cars causing accidents on the road.
In this thesis, we hence develop a system in order to support ML practitioners in debugging their computer vision models, even before deploying them and having access to serving data.

We take inspiration from prior ongoing works in order to formulate the current diagnosis problem, identify its challenges, and envision a human-computation-based solution. We then thoroughly analyse the requirements for developing a system instantiating the solution, actually design such a system, and implement it in a well-functioning, full-fledged, highly-modular, and easily-customizable system.
The solution is based on the definition of human computation operations, that, altogether, allow to a) identify the mechanisms a human would expect the model to learn in an ideal world, b) identify the mechanisms the model has actually learned (via annotations of saliency maps), and c) to compare these two sets of mechanisms to conclude about the good behavior of the model. The solution is especially made to account for certainty issues in the work of the human workers, and to handle ambiguous granularities in the concepts the model might have learned.
To the best of our knowledge, our work is the first system that allows an ML practitioner to first identify their own goals for debugging a model (among a large diversity of goals), accounting for their limited monetary budget, then to configure a debugging session according to these goals, and finally to fully-automatically run the system with such configuration to obtain a model debugging report.

Finally, we conduct a thorough investigation of our system. First, we set-up to understand the correctness and informativeness of the outputs, by running the system with various configurations on different models trained using various datasets, for which the biases are more or less controlled. This first evaluation particularly shows that the outputs and its implementation are correct. With these outputs, we are able to identify the biases that have been injected in the model, as well as to learn about previously unknown behaviors of highly-common models that are used by many practitioners.
Second, we evaluate the cost-effectiveness of running the system. For that, we ran tests in two settings: when the human workers might make mistakes (e.g., due to a lack of expertise, the complexity of the task, or inattention), and when human workers are fully accurate. We vary the configurations of the system (e.g., the order in which the human operations are conducted, the number of workers allocated at the start of the debugging session) within the two settings, and we observe how the number of human operations needed evolve, in order to reach correct system outputs. We find that the system's output is potentially relevant, informative and complete. The system output provide an in depth analysis of the model's behaviour and unravel what the model comprehends, where it falls short, and what it should ideally have grasped.

All in all, in this thesis, we build the system and thoroughly evaluate it. While we identify a number of conceptual and practical limitations of this system (e.g., difficulty to annotate concepts, potentially high cost), our work constitutes a first step towards developing complete solutions to help practitioners debug their system. We encourage readers to build on our work, in order to further optimize our system for cost. Note that we make all our code publicly available for anyone to re-use our system, or reproduce our experiments. ...

Machine learning (ML) systems for computer vision applications are widely deployed in decision-making contexts, including high-stakes domains such as autonomous driving and medical diagnosis. While largely accelerating the decision-making process, those systems have been found to suffer from a severe issue of reliability, i.e., they can easily fail on serving data that are slightly different from the data captured during their training phase. Such an issue has resulted in undesired outcomes with safety, ethical, and societal concerns across various applications, such as numerous examples of semi-automatic cars causing accidents on the road.
In this thesis, we hence develop a system in order to support ML practitioners in debugging their computer vision models, even before deploying them and having access to serving data.

We take inspiration from prior ongoing works in order to formulate the current diagnosis problem, identify its challenges, and envision a human-computation-based solution. We then thoroughly analyse the requirements for developing a system instantiating the solution, actually design such a system, and implement it in a well-functioning, full-fledged, highly-modular, and easily-customizable system.
The solution is based on the definition of human computation operations, that, altogether, allow to a) identify the mechanisms a human would expect the model to learn in an ideal world, b) identify the mechanisms the model has actually learned (via annotations of saliency maps), and c) to compare these two sets of mechanisms to conclude about the good behavior of the model. The solution is especially made to account for certainty issues in the work of the human workers, and to handle ambiguous granularities in the concepts the model might have learned.
To the best of our knowledge, our work is the first system that allows an ML practitioner to first identify their own goals for debugging a model (among a large diversity of goals), accounting for their limited monetary budget, then to configure a debugging session according to these goals, and finally to fully-automatically run the system with such configuration to obtain a model debugging report.

Finally, we conduct a thorough investigation of our system. First, we set-up to understand the correctness and informativeness of the outputs, by running the system with various configurations on different models trained using various datasets, for which the biases are more or less controlled. This first evaluation particularly shows that the outputs and its implementation are correct. With these outputs, we are able to identify the biases that have been injected in the model, as well as to learn about previously unknown behaviors of highly-common models that are used by many practitioners.
Second, we evaluate the cost-effectiveness of running the system. For that, we ran tests in two settings: when the human workers might make mistakes (e.g., due to a lack of expertise, the complexity of the task, or inattention), and when human workers are fully accurate. We vary the configurations of the system (e.g., the order in which the human operations are conducted, the number of workers allocated at the start of the debugging session) within the two settings, and we observe how the number of human operations needed evolve, in order to reach correct system outputs. We find that the system's output is potentially relevant, informative and complete. The system output provide an in depth analysis of the model's behaviour and unravel what the model comprehends, where it falls short, and what it should ideally have grasped.

All in all, in this thesis, we build the system and thoroughly evaluate it. While we identify a number of conceptual and practical limitations of this system (e.g., difficulty to annotate concepts, potentially high cost), our work constitutes a first step towards developing complete solutions to help practitioners debug their system. We encourage readers to build on our work, in order to further optimize our system for cost. Note that we make all our code publicly available for anyone to re-use our system, or reproduce our experiments.

Are BERT-based fact-checking models robust against adversarial attack?

Bachelor thesis (2023) - E.E. Afriat, Avishek Anand, Lijun Lyu, L. Corti

We seek to examine the vulnerability of BERT-based fact-checking. We implement a gradient based, adversarial attack strategy, based on Hotflip swapping individual tokens from the input. We use this on a pre-trained ExPred model for fact-checking. We find that gradient based adversarial attacks are ineffective against ExPred. Uncertainties about the similitude of the examples generated by our adversarial attack implementation cast doubts on the results. ...

Finding Shortcuts to a black-box model using Frequent Sequence Mining

Explaining Deep Learning models for Fact-Checking

Bachelor thesis (2023) - J.P. Smit, A. Anand, L. Lyu, L. Corti, M. Loog

Deep-learning (DL) models could greatly advance the automation of fact-checking, yet have not widely been adopted by the public because of their hard-to-explain nature. Although various techniques have been proposed to use local explanations for the behaviour of DL models, little attention has been paid to global explanations.
In response, we investigate whether a frequent sequence mining (FSM) tool finds sequence patterns, that act as shortcuts, to a state-of-the-art model in the context of fact-checking. By studying the connections between a model’s input and output, association rules (ARs) can be used as a global explanation for the interpretation of the model. The shortcuts were evaluated using a heuristic-based minimum support value, the strength of each rule was determined using confidence, and the support value indicates the global coverage of rules. Shortcuts help to form an interpretation for creating counterfactual prompts, which can be used as a risk assessment tool for DL models. Other applications for rule-based global explanations are left for future work ...

Evaluating Feature Attribution Methods: an Usecase on a Neural Fact-checking Model

Bachelor thesis (2023) - A. Simons, A. Anand, L. Lyu, L. Corti, M. Loog

In today's society, claims are everywhere, in the online and offline world. Fact-checking models can check these claims and predict if a claim is true or false, but how can these models be checked? Post-hoc XAI feature attribution methods can be used for this. These methods give scores indicating the influence of the individual tokens on the model's decision-making. In our research, we evaluate three popular feature attribution methods in the context of fact-checking: LIME, Kernel SHAP, and Integrated Gradients. We used the NLP architecture ExPred as a fact-checking model in our research. The feature attribution methods were evaluated using a human-grounded and pseudo ground truth evaluation. The results from these evaluations indicate that Integrated Gradients enables humans to form an opinion better and performs better in our pseudo ground truth evaluation. A potential explanation is that the iterations should be increased for LIME and Kernel SHAP. Our findings suggest that Integrated Gradients performs better in our study. Still, more research for other tasks and models would be beneficial to ensure that these results apply to other cases. ...

How do different explanation presentation strategies of feature and data attribution techniques affect non-expert understanding?

Explaining Deep Learning models for Fact-Checking

Bachelor thesis (2023) - S. Singh, A. Anand, L. Lyu, L. Corti, M. Loog

The goal of this paper is to examine how different presentation strategies of Explanainable Artificial Intelligence (XAI) explanation methods for textual data affect non-expert understanding in the context of fact-checking. The importance of understand- ing the decision of an Artificial Intelligence (AI) in human-AI interaction and the need for effective explanation methods to improve trust in AI models are highlighted. The study focuses on three expla- nation methods: interpretable-by-design model Ex- Pred, feature attribution technique LIME, and in- stance attribution method k-NN. Two presentation strategies were compared for each method, and par- ticipants were presented with a set of claims and asked to indicate their understanding and level of agreement with the AI’s classification. The main hypothesis is that participants will appreciate all available context and details, as long as it is pre- sented in a structured way, and will find visual rep- resentations of data easier to understand than tex- tual ones. Results from the study indicate that par- ticipants prefer explanations that are simple and structured, and that visual presentations are not as effective, especially when it is the first time a user interacts with this type of data. Additionally, it was found that better formatting leads to a better- calibrated understanding of the explanation. The results of this study will provide valuable insight into the best way to present XAI explanations to non-experts to enhance their understanding and re- duce the deployment risk associated with Natural Language Processing (NLP) models for automated fact-checking. The study’s code, data, and Figma templates are publicly available for reproducibility. ...