LLM-Based Test Generation from Bytecode

None, None

LLM-Based Test Generation from Bytecode

An Empirical Comparison with EvoSuite and the Effect of Example Test Artifacts

Bachelor Thesis (2026)

Author(s)

J.L. Overmars (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

C.R. Paulsen – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

S. Proksch – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

S.S. Chakraborty – Graduation committee member (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Faculty

Electrical Engineering, Mathematics and Computer Science

Mutation Testing Automated Test Generation Llm LLM-based test generation Bytecode Representation Test Examples

To reference this document use

https://resolver.tudelft.nl/uuid:484e28ce-c461-4fe8-ba23-86665c966cf9

More Info

expand_more

Publication Year

2026

Language

English

Graduation Date

26-06-2026

Awarding Institution

Delft University of Technology

Project

CSE3000 Research Project

Programme

Computer Science and Engineering

Faculty

Electrical Engineering, Mathematics and Computer Science

Downloads counter

5

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Automated test generation for third-party libraries is essential for detecting behavioral changes introduced by updates. Existing tools like EvoSuite operate on bytecode, but achieve only moderate mutation scores, leaving room for improvement. Recent work has shown that LLMs can outperform EvoSuite on mutation score when given source code, but it is unclear whether these results hold when only bytecode is available. Additionally, LLM-generated tests suffer from hallucinations and low compilation rates, and it is unknown whether providing example tests can mitigate these issues. This paper investigates whether LLMs can generate effective test suites from bytecode alone, and whether adding test artifacts (source code or decompiled bytecode) improves compilation rates and overall test quality. I compare LLM-generated tests with EvoSuite across five Java libraries, evaluating line, branch, and method coverage, as well as mutation score. The results show that LLMs outperform EvoSuite on coverage for three of five libraries and achieve higher per-class mutation scores, though they struggle slightly more with very large classes. Adding example tests improves branch coverage across some libraries but decreases method coverage, with little difference between source code and bytecode examples. Test artifacts did not consistently improve compilation rates or overall quality. These findings suggest that LLM-based test generation on bytecode is a viable alternative to EvoSuite, and can even outperform it on certain libraries.

Files

Research_paper_8_.pdf

(pdf | 0.157 Mb)