LLM-Based Test Generation from Bytecode
An Empirical Comparison with EvoSuite and the Effect of Example Test Artifacts
J.L. Overmars (TU Delft - Electrical Engineering, Mathematics and Computer Science)
C.R. Paulsen – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)
S. Proksch – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)
S.S. Chakraborty – Graduation committee member (TU Delft - Electrical Engineering, Mathematics and Computer Science)
More Info
expand_more
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.
Abstract
Automated test generation for third-party libraries is essential for detecting behavioral changes introduced by updates. Existing tools like EvoSuite operate on bytecode, but achieve only moderate mutation scores, leaving room for improvement. Recent work has shown that LLMs can outperform EvoSuite on mutation score when given source code, but it is unclear whether these results hold when only bytecode is available. Additionally, LLM-generated tests suffer from hallucinations and low compilation rates, and it is unknown whether providing example tests can mitigate these issues. This paper investigates whether LLMs can generate effective test suites from bytecode alone, and whether adding test artifacts (source code or decompiled bytecode) improves compilation rates and overall test quality. I compare LLM-generated tests with EvoSuite across five Java libraries, evaluating line, branch, and method coverage, as well as mutation score. The results show that LLMs outperform EvoSuite on coverage for three of five libraries and achieve higher per-class mutation scores, though they struggle slightly more with very large classes. Adding example tests improves branch coverage across some libraries but decreases method coverage, with little difference between source code and bytecode examples. Test artifacts did not consistently improve compilation rates or overall quality. These findings suggest that LLM-based test generation on bytecode is a viable alternative to EvoSuite, and can even outperform it on certain libraries.