ZS
Z. Seyedghorban
info
Please Note
<p>This page displays the records of the person named above and is not linked to a unique person identifier. This record may need to be merged to a profile.</p>
4 records found
1
Can Timing Localize Agent Failures?
Incorperating a temporal dimension into spectrum-based fault localization for LLM multi-agent systems
Large language models are fallible and indeter- ministic, which makes fault localization harder than in traditional deterministic software. It gets harder still in LLM-based Multi Agent Sys- tems (LLM-MAS), where a fault can be intro- duced by any agent at any point, rarely repro- duces, and tends to propagate throughout the rest of an execution. Spectrum-based fault lo- calization (SBFL) has worked well on classical software, and prior work has applied it to LLM- MAS with promising but substandard accuracy. This research proposes a new spectrum, one based on temporality. Each spectrum element is defined by an agent, an action, and the temporal window in which the action occurs. These win- dows can take many shapes; we compare four: relative static, absolute static, sliding, and dy- namic windowing. The spectra are built from a new dataset of HyperAgent traces collected over three SWE-Bench Verified tasks, with the ground-truth faults and message labels assigned by an LLM-as-a-judge, and each windowing is evaluated by top-k accuracy. Relative static win- dowing with a high partition count performed best. The other strategies showed little consis- tent improvement over the baseline. Temporal ordering can therefore be correlated with faults, but the agent-action-window spectrum does not yet capture them reliably. A richer, denser spec- trum will be needed to reach a workable level of accuracy.
...
Large language models are fallible and indeter- ministic, which makes fault localization harder than in traditional deterministic software. It gets harder still in LLM-based Multi Agent Sys- tems (LLM-MAS), where a fault can be intro- duced by any agent at any point, rarely repro- duces, and tends to propagate throughout the rest of an execution. Spectrum-based fault lo- calization (SBFL) has worked well on classical software, and prior work has applied it to LLM- MAS with promising but substandard accuracy. This research proposes a new spectrum, one based on temporality. Each spectrum element is defined by an agent, an action, and the temporal window in which the action occurs. These win- dows can take many shapes; we compare four: relative static, absolute static, sliding, and dy- namic windowing. The spectra are built from a new dataset of HyperAgent traces collected over three SWE-Bench Verified tasks, with the ground-truth faults and message labels assigned by an LLM-as-a-judge, and each windowing is evaluated by top-k accuracy. Relative static win- dowing with a high partition count performed best. The other strategies showed little consis- tent improvement over the baseline. Temporal ordering can therefore be correlated with faults, but the agent-action-window spectrum does not yet capture them reliably. A richer, denser spec- trum will be needed to reach a workable level of accuracy.
Spectrum-based Fault Localization for LLM-based Multi-Agent Systems
Identifying Faulty Agent Roles through Spectrum Analysis of Execution Traces
Bachelor thesis
(2026)
-
Dan Nguyen Le Kha Dan, B. Özkan, A. Panichella, Z. Seyedghorban, M.T.J. Spaan
Large Language Model-based Multi-Agent Systems (LLM-MAS) are promising frameworks for automating complex, real-world tasks. However, when these systems fail, failure attribution is challenging due to the stochastic behavior of Large Language Models (LLMs) and the distributed decision-making process of multi-agent collaboration. This paper investigates whether Spectrum-based Fault Localization (SBFL), a well established technique in software testing and debugging, can be applied to identify faulty agent roles in LLM-MAS. We evaluate SBFL on HyperAgent across five SWE-bench Verified tasks, defining the spectra based on role message frequency and semantic output overlap. Agent roles are ranked by their computed suspiciousness scores and fault localization performance is measured using Top-1 and Top-3 accuracy against ground truth labels established by an LLM-as-a-judge. Our results show that semantic output overlap achieves the highest Top-1 accuracy of 60%, consistently outperforming raw message frequency spectra. However, none of the evaluated spectrum representations produces reliable Top-3 rankings, no SBFL formula consistently outperforms the others, and adding more execution runs does not consistently improve fault localization performance. These findings suggest that SBFL can support role level fault localization in LLM-MAS, but its effectiveness depends strongly on spectrum design and remains limited for reliably identifying multiple faulty roles.
...
Large Language Model-based Multi-Agent Systems (LLM-MAS) are promising frameworks for automating complex, real-world tasks. However, when these systems fail, failure attribution is challenging due to the stochastic behavior of Large Language Models (LLMs) and the distributed decision-making process of multi-agent collaboration. This paper investigates whether Spectrum-based Fault Localization (SBFL), a well established technique in software testing and debugging, can be applied to identify faulty agent roles in LLM-MAS. We evaluate SBFL on HyperAgent across five SWE-bench Verified tasks, defining the spectra based on role message frequency and semantic output overlap. Agent roles are ranked by their computed suspiciousness scores and fault localization performance is measured using Top-1 and Top-3 accuracy against ground truth labels established by an LLM-as-a-judge. Our results show that semantic output overlap achieves the highest Top-1 accuracy of 60%, consistently outperforming raw message frequency spectra. However, none of the evaluated spectrum representations produces reliable Top-3 rankings, no SBFL formula consistently outperforms the others, and adding more execution runs does not consistently improve fault localization performance. These findings suggest that SBFL can support role level fault localization in LLM-MAS, but its effectiveness depends strongly on spectrum design and remains limited for reliably identifying multiple faulty roles.
Interaction Pattern-Based Fault Localization in Multi-Agent Systems
Correlating Agent Execution Sequences with System Failures
Debugging Large Language Model-based Multi- Agent Systems (LLM-MAS) is challenging because failures emerge from semantic, nondeterministic conversational breakdowns rather than syntactic errors, turning verbose execution logs into a major debugging bottleneck. Traditional Spectrum-Based Fault Localization (SBFL) cannot isolate these flaws since it tracks code-level execution rather than agent actions and interactions. This research utilizes interaction pattern-based SBFL, enhanced by Markov Chain Surprise and statistical validation, to narrow the developer search space by correlating short n-grams with execution failures. Evaluated on the MAST dataset across MetaGPT, ChatDev, HyperAgent, and AG2, the pipeline abstracts raw logs into uniform sequences by using a multi-framework tokenizer. Across the four evaluated frameworks, at n=4, SBFL and Markov surprise selected the same top-three candidate patterns, although their internal rank order sometimes differed. System effectiveness is evaluated through single- and cross-task evaluations, qualitative mapping to MAST failure modes, and semantic verification via an LLM-as-a-judge baseline. Additionally, this validation shows that top-ranked windows capture initiating failure interactions rather than downstream effects: on MetaGPT traces, rank-1 windows receive a caused verdict in 68.4% of triggered failing runs at n=4 (and 70.4% at n=3), vs. ∼15.3% for random windows, as well as 63.4% vs. 20.9% on AG2 (n=4).
...
Debugging Large Language Model-based Multi- Agent Systems (LLM-MAS) is challenging because failures emerge from semantic, nondeterministic conversational breakdowns rather than syntactic errors, turning verbose execution logs into a major debugging bottleneck. Traditional Spectrum-Based Fault Localization (SBFL) cannot isolate these flaws since it tracks code-level execution rather than agent actions and interactions. This research utilizes interaction pattern-based SBFL, enhanced by Markov Chain Surprise and statistical validation, to narrow the developer search space by correlating short n-grams with execution failures. Evaluated on the MAST dataset across MetaGPT, ChatDev, HyperAgent, and AG2, the pipeline abstracts raw logs into uniform sequences by using a multi-framework tokenizer. Across the four evaluated frameworks, at n=4, SBFL and Markov surprise selected the same top-three candidate patterns, although their internal rank order sometimes differed. System effectiveness is evaluated through single- and cross-task evaluations, qualitative mapping to MAST failure modes, and semantic verification via an LLM-as-a-judge baseline. Additionally, this validation shows that top-ranked windows capture initiating failure interactions rather than downstream effects: on MetaGPT traces, rank-1 windows receive a caused verdict in 68.4% of triggered failing runs at n=4 (and 70.4% at n=3), vs. ∼15.3% for random windows, as well as 63.4% vs. 20.9% on AG2 (n=4).
We investigate whether specification-based fault localization (spec-based FL) can identify failure modes and families in failing LLM-based multi-agent systems (LLM-MAS), evaluated on the MAST Multi-Agent Debate (MAD) dataset. We implement a six-stage pipeline that extracts global and dynamic behavioral constraints from execution traces, evaluates them step-by-step, and uses the resulting violation log to drive an LLM judge toward a structured failure diagnosis.
On the 18-trace human-annotated MAD-Human dataset, the pipeline achieves 33.3% strict mode and 50.0% strict family accuracy, compared to 5.6% and 22.2% for a no-specification baseline; comparable gains are observed on a 14-trace HyperAgent SWE-Bench-Lite subset. Analysis of constraint violation logs suggests that the taxonomy targets carried by constraints, not their syntactic type, may be a primary driver of diagnostic accuracy, and that three constraints per step achieves equivalent accuracy to five at substantially lower cost. ...
On the 18-trace human-annotated MAD-Human dataset, the pipeline achieves 33.3% strict mode and 50.0% strict family accuracy, compared to 5.6% and 22.2% for a no-specification baseline; comparable gains are observed on a 14-trace HyperAgent SWE-Bench-Lite subset. Analysis of constraint violation logs suggests that the taxonomy targets carried by constraints, not their syntactic type, may be a primary driver of diagnostic accuracy, and that three constraints per step achieves equivalent accuracy to five at substantially lower cost. ...
We investigate whether specification-based fault localization (spec-based FL) can identify failure modes and families in failing LLM-based multi-agent systems (LLM-MAS), evaluated on the MAST Multi-Agent Debate (MAD) dataset. We implement a six-stage pipeline that extracts global and dynamic behavioral constraints from execution traces, evaluates them step-by-step, and uses the resulting violation log to drive an LLM judge toward a structured failure diagnosis.
On the 18-trace human-annotated MAD-Human dataset, the pipeline achieves 33.3% strict mode and 50.0% strict family accuracy, compared to 5.6% and 22.2% for a no-specification baseline; comparable gains are observed on a 14-trace HyperAgent SWE-Bench-Lite subset. Analysis of constraint violation logs suggests that the taxonomy targets carried by constraints, not their syntactic type, may be a primary driver of diagnostic accuracy, and that three constraints per step achieves equivalent accuracy to five at substantially lower cost.
On the 18-trace human-annotated MAD-Human dataset, the pipeline achieves 33.3% strict mode and 50.0% strict family accuracy, compared to 5.6% and 22.2% for a no-specification baseline; comparable gains are observed on a 14-trace HyperAgent SWE-Bench-Lite subset. Analysis of constraint violation logs suggests that the taxonomy targets carried by constraints, not their syntactic type, may be a primary driver of diagnostic accuracy, and that three constraints per step achieves equivalent accuracy to five at substantially lower cost.