Specification-Based Fault Localization in LLM based Multi-Agent Systems

None, None

Specification-Based Fault Localization in LLM based Multi-Agent Systems

Bachelor Thesis (2026)

Author(s)

M. Aksoy (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

B. Özkan – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

A. Panichella – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Z. Seyedghorban – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Faculty

Electrical Engineering, Mathematics and Computer Science

Fault Localization Specification Specification extraction Multi-Agent Systems (MAS)

To reference this document use

https://resolver.tudelft.nl/uuid:a36c8e5c-59d8-439f-9155-bef21bf807ab

More Info

expand_more

Publication Year

2026

Language

English

Graduation Date

24-06-2026

Awarding Institution

Delft University of Technology

Project

CSE3000 Research Project

Programme

Computer Science and Engineering

Faculty

Electrical Engineering, Mathematics and Computer Science

Downloads counter

14

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

We investigate whether specification-based fault localization (spec-based FL) can identify failure modes and families in failing LLM-based multi-agent systems (LLM-MAS), evaluated on the MAST Multi-Agent Debate (MAD) dataset. We implement a six-stage pipeline that extracts global and dynamic behavioral constraints from execution traces, evaluates them step-by-step, and uses the resulting violation log to drive an LLM judge toward a structured failure diagnosis.

On the 18-trace human-annotated MAD-Human dataset, the pipeline achieves 33.3% strict mode and 50.0% strict family accuracy, compared to 5.6% and 22.2% for a no-specification baseline; comparable gains are observed on a 14-trace HyperAgent SWE-Bench-Lite subset. Analysis of constraint violation logs suggests that the taxonomy targets carried by constraints, not their syntactic type, may be a primary driver of diagnostic accuracy, and that three constraints per step achieves equivalent accuracy to five at substantially lower cost.

Files

RP_SPEC_FAULT_LOCALIZATION.pdf

(pdf | 0.255 Mb)

License info not available