LLM-Based Unit Test Generation Without Source Code
An Empirical Evaluation of Bytecode Representations, Prompt Engineering, Model Selection, and Temperature Settings
A.Z. Głodek (TU Delft - Electrical Engineering, Mathematics and Computer Science)
C.R. Paulsen – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)
S. Proksch – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)
More Info
expand_more
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.
Abstract
Automated unit test generation is often used to reduce the manual effort required to create and maintain software tests. Recently, Large Language Models have shown promising results in generating unit tests directly from source code; however, existing work assumes source code availability, which is not always the case. Many third-party libraries are distributed only as compiled artifacts, making source-code-based test generation difficult or impossible. Generating tests from compiled software could help developers evaluate the behavioural compatibility of dependency updates without access to the original codebase, but it remains unclear how well LLMs perform when only bytecode is accessible.
This research investigates LLM-based unit test generation using only bytecode. I developed an automated pipeline that generates, compiles, executes, and evaluates JUnit tests for Java libraries from disassembled and decompiled bytecode. I used this pipeline to study how model choice, representation, prompting, and temperature affect compilation, execution, and coverage. I also evaluated the best configuration using iterative prompting and compared it against EvoSuite.
Across 50 Java libraries, the best configuration achieved 89.5% compilation success and 83.6% execution success. Few-shot prompting and higher temperatures produced the strongest single-pass results, while iterative prompting nearly doubled coverage. Compared with EvoSuite, the approach produced usable tests for all 19 evaluated libraries, while EvoSuite succeeded on 9 but achieved higher coverage where it succeeded. These results suggest that bytecode-based LLM test generation is promising when source code is unavailable.