BD

B. Duminicã

info

Please Note

1 records found

Correlating Agent Execution Sequences with System Failures

Debugging Large Language Model-based Multi- Agent Systems (LLM-MAS) is challenging because failures emerge from semantic, nondeterministic conversational breakdowns rather than syntactic errors, turning verbose execution logs into a major debugging bottleneck. Traditional Spectrum-Based Fault Localization (SBFL) cannot isolate these flaws since it tracks code-level execution rather than agent actions and interactions. This research utilizes interaction pattern-based SBFL, enhanced by Markov Chain Surprise and statistical validation, to narrow the developer search space by correlating short n-grams with execution failures. Evaluated on the MAST dataset across MetaGPT, ChatDev, HyperAgent, and AG2, the pipeline abstracts raw logs into uniform sequences by using a multi-framework tokenizer. Across the four evaluated frameworks, at n=4, SBFL and Markov surprise selected the same top-three candidate patterns, although their internal rank order sometimes differed. System effectiveness is evaluated through single- and cross-task evaluations, qualitative mapping to MAST failure modes, and semantic verification via an LLM-as-a-judge baseline. Additionally, this validation shows that top-ranked windows capture initiating failure interactions rather than downstream effects: on MetaGPT traces, rank-1 windows receive a caused verdict in 68.4% of triggered failing runs at n=4 (and 70.4% at n=3), vs. ∼15.3% for random windows, as well as 63.4% vs. 20.9% on AG2 (n=4). ...