LLM-Based Test Generation from Bytecode

An Empirical Comparison with EvoSuite and the Effect of Example Test Artifacts

Bachelor Thesis (2026)
Author(s)

J.L. Overmars (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

C.R. Paulsen – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

S. Proksch – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

S.S. Chakraborty – Graduation committee member (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Faculty
Electrical Engineering, Mathematics and Computer Science
More Info
expand_more
Publication Year
2026
Language
English
Graduation Date
26-06-2026
Awarding Institution
Delft University of Technology
Project
CSE3000 Research Project
Programme
Computer Science and Engineering
Faculty
Electrical Engineering, Mathematics and Computer Science
Downloads counter
5
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Automated test generation for third-party libraries is essential for detecting behavioral changes introduced by updates. Existing tools like EvoSuite operate on bytecode, but achieve only moderate mutation scores, leaving room for improvement. Recent work has shown that LLMs can outperform EvoSuite on mutation score when given source code, but it is unclear whether these results hold when only bytecode is available. Additionally, LLM-generated tests suffer from hallucinations and low compilation rates, and it is unknown whether providing example tests can mitigate these issues. This paper investigates whether LLMs can generate effective test suites from bytecode alone, and whether adding test artifacts (source code or decompiled bytecode) improves compilation rates and overall test quality. I compare LLM-generated tests with EvoSuite across five Java libraries, evaluating line, branch, and method coverage, as well as mutation score. The results show that LLMs outperform EvoSuite on coverage for three of five libraries and achieve higher per-class mutation scores, though they struggle slightly more with very large classes. Adding example tests improves branch coverage across some libraries but decreases method coverage, with little difference between source code and bytecode examples. Test artifacts did not consistently improve compilation rates or overall quality. These findings suggest that LLM-based test generation on bytecode is a viable alternative to EvoSuite, and can even outperform it on certain libraries.