Evaluating Self-Correcting LLM Agents for Robust Test Assertion Generation
H. Galitianu (TU Delft - Electrical Engineering, Mathematics and Computer Science)
A. Panichella – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)
Mitchell Olsthoorn – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)
More Info
expand_more
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.
Abstract
Robust test assertions are critical for verifying deep semantic behavior, but their automated generation remains a primary bottleneck in software testing. Automated test case generation approaches often rely on implicit oracles or regression checks that miss semantic failures. Large language models (LLMs) can synthesize meaningful assertions, but single-pass prompting frequently produces uncompilable or failing code. We propose a multi-agent workflow for Java test assertion generation consisting of code comprehension, test objective planning, and assertion generation. The workflow extracts mutation-relevant variable manifests, structures high-level testing plans, compiles and executes the generated test candidates, and iteratively refines assertions using mutation-testing feedback from PITest to optimize mutation quality before final selection.
We evaluate the approach on 112 focal tests from twilio-java and liqp. Compared with static prompting, agentic configurations substantially improve reliability, increasing the percentage of valid runs (compilable, executable, and passing tests) from 58.1% to 84.8%. Relative to the human baseline, the agentic configuration raises the average Test Strength (the ratio of killed mutants to covered mutants) from 45.6% to approximately 56%. Our evaluation shows that while execution feedback significantly improves reliability and observed Test Strength, combining all agentic components does not yield the best computational trade-off.