Can Timing Localize Agent Failures?
Incorperating a temporal dimension into spectrum-based fault localization for LLM multi-agent systems
H.M. Schouwenaars (TU Delft - Electrical Engineering, Mathematics and Computer Science)
B. Özkan – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)
A. Panichella – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)
Z. Seyedghorban – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)
M.T.J. Spaan – Graduation committee member (TU Delft - Electrical Engineering, Mathematics and Computer Science)
More Info
expand_more
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.
Abstract
Large language models are fallible and indeter- ministic, which makes fault localization harder than in traditional deterministic software. It gets harder still in LLM-based Multi Agent Sys- tems (LLM-MAS), where a fault can be intro- duced by any agent at any point, rarely repro- duces, and tends to propagate throughout the rest of an execution. Spectrum-based fault lo- calization (SBFL) has worked well on classical software, and prior work has applied it to LLM- MAS with promising but substandard accuracy. This research proposes a new spectrum, one based on temporality. Each spectrum element is defined by an agent, an action, and the temporal window in which the action occurs. These win- dows can take many shapes; we compare four: relative static, absolute static, sliding, and dy- namic windowing. The spectra are built from a new dataset of HyperAgent traces collected over three SWE-Bench Verified tasks, with the ground-truth faults and message labels assigned by an LLM-as-a-judge, and each windowing is evaluated by top-k accuracy. Relative static win- dowing with a high partition count performed best. The other strategies showed little consis- tent improvement over the baseline. Temporal ordering can therefore be correlated with faults, but the agent-action-window spectrum does not yet capture them reliably. A richer, denser spec- trum will be needed to reach a workable level of accuracy.