Evaluating Self-Correcting LLM Agents for Robust Test Assertion Generation

None, None

Evaluating Self-Correcting LLM Agents for Robust Test Assertion Generation

Bachelor Thesis (2026)

Author(s)

H. Galitianu (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

A. Panichella – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Mitchell Olsthoorn – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Faculty

Electrical Engineering, Mathematics and Computer Science

Large Language Models Multi-agent systems LLM Automated Software Testing Mutation testing Test Oracle Generation Self-Correction Assetion generation

To reference this document use

https://resolver.tudelft.nl/uuid:7b8bd5c3-fa83-4ae2-9bc0-37ca94980fee

More Info

expand_more

Publication Year

2026

Language

English

Graduation Date

23-06-2026

Awarding Institution

Delft University of Technology

Project

CSE3000 Research Project

Programme

Computer Science and Engineering

Faculty

Electrical Engineering, Mathematics and Computer Science

Downloads counter

22

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Robust test assertions are critical for verifying deep semantic behavior, but their automated generation remains a primary bottleneck in software testing. Automated test case generation approaches often rely on implicit oracles or regression checks that miss semantic failures. Large language models (LLMs) can synthesize meaningful assertions, but single-pass prompting frequently produces uncompilable or failing code. We propose a multi-agent workflow for Java test assertion generation consisting of code comprehension, test objective planning, and assertion generation. The workflow extracts mutation-relevant variable manifests, structures high-level testing plans, compiles and executes the generated test candidates, and iteratively refines assertions using mutation-testing feedback from PITest to optimize mutation quality before final selection.

We evaluate the approach on 112 focal tests from twilio-java and liqp. Compared with static prompting, agentic configurations substantially improve reliability, increasing the percentage of valid runs (compilable, executable, and passing tests) from 58.1% to 84.8%. Relative to the human baseline, the agentic configuration raises the average Test Strength (the ratio of killed mutants to covered mutants) from 45.6% to approximately 56%. Our evaluation shows that while execution feedback significantly improves reliability and observed Test Strength, combining all agentic components does not yield the best computational trade-off.

Files

Evaluating_Self_Correcting_LLM... (pdf)

(pdf | 0.771 Mb)

License info not available