Evaluating Self-Correcting LLM Agents for Robust Test Assertion Generation

Bachelor Thesis (2026)
Author(s)

H. Galitianu (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

A. Panichella – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Mitchell Olsthoorn – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Faculty
Electrical Engineering, Mathematics and Computer Science
More Info
expand_more
Publication Year
2026
Language
English
Graduation Date
23-06-2026
Awarding Institution
Delft University of Technology
Project
CSE3000 Research Project
Programme
Computer Science and Engineering
Faculty
Electrical Engineering, Mathematics and Computer Science
Downloads counter
4
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Robust test assertions are critical for verifying deep semantic behavior, but their automated generation remains a primary bottleneck in software testing. Automated test case generation approaches often rely on implicit oracles or regression checks that miss semantic failures. Large language models (LLMs) can synthesize meaningful assertions, but single-pass prompting frequently produces uncompilable or failing code. We propose a multi-agent workflow for Java test assertion generation consisting of code comprehension, test objective planning, and assertion generation. The workflow extracts mutation-relevant variable manifests, structures high-level testing plans, compiles and executes the generated test candidates, and iteratively refines assertions using mutation-testing feedback from PITest to optimize mutation quality before final selection.

We evaluate the approach on 112 focal tests from twilio-java and liqp. Compared with static prompting, agentic configurations substantially improve reliability, increasing the percentage of valid runs (compilable, executable, and passing tests) from 58.1% to 84.8%. Relative to the human baseline, the agentic configuration raises the average Test Strength (the ratio of killed mutants to covered mutants) from 45.6% to approximately 56%. Our evaluation shows that while execution feedback significantly improves reliability and observed Test Strength, combining all agentic components does not yield the best computational trade-off.

Files

License info not available