Fault Localization in LLM-Based Multi-Agent Systems

Scope-Guided LLM Judging for Responsible-Agent and Failure-Step Attribution

Bachelor Thesis (2026)
Author(s)

Y.S. Pachedzhiev (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

B. Özkan – Graduation committee member (TU Delft - Electrical Engineering, Mathematics and Computer Science)

A. Panichella – Graduation committee member (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Z. Seyedghorban – Graduation committee member (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Faculty
Electrical Engineering, Mathematics and Computer Science
More Info
expand_more
Publication Year
2026
Language
English
Graduation Date
24-06-2026
Awarding Institution
Delft University of Technology
Project
CSE3000 Research Project
Programme
Computer Science and Engineering
Faculty
Electrical Engineering, Mathematics and Computer Science
Downloads counter
6
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

LLM-based multi-agent systems often produce long execution traces with agent messages, tool outputs, intermediate decisions, and final responses. When a task fails, the failed outcome usually does not show which agent caused the failure or which earlier step introduced it. This paper treats fault localization as failure attribution: predicting the responsible agent and the decisive failure step in a failed multi-agent trace. It compares a direct whole-trace baseline with two-stage scope-guided judging on the Hand-Crafted subset of the Who\&When benchmark. In the direct baseline, one LLM judge receives the full trace and predicts both labels. In the scope-guided methods, a first-stage selector chooses a small set of reference steps, and the same final judge predicts the labels from the full trace plus those selected steps. The experiments show that scope guidance is not generally beneficial. Generic LLM scope selection improves selected-scope Hit@5 over random selection, but does not improve final attribution over direct whole-trace judging. The source-candidate-pool selector gives the best responsible-agent, failure-step, and joint attribution accuracy, but the improvement is modest and requires more than four times the mean token cost of direct whole-trace judging. Overall, scope guidance helps only when the selected steps point the judge toward earlier source-level evidence. Direct whole-trace judging remains a strong lower-cost baseline.

Files

Rp_final.pdf
(pdf | 1.19 Mb)
License info not available