LLM-Based Unit Test Generation Without Source Code

An Empirical Evaluation of Bytecode Representations, Prompt Engineering, Model Selection, and Temperature Settings

Bachelor Thesis (2026)
Author(s)

A.Z. Głodek (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

C.R. Paulsen – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

S. Proksch – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Faculty
Electrical Engineering, Mathematics and Computer Science
More Info
expand_more
Publication Year
2026
Language
English
Graduation Date
22-06-2026
Awarding Institution
Delft University of Technology
Project
CSE3000 Research Project
Programme
Computer Science and Engineering
Faculty
Electrical Engineering, Mathematics and Computer Science
Downloads counter
7
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Automated unit test generation is often used to reduce the manual effort required to create and maintain software tests. Recently, Large Language Models have shown promising results in generating unit tests directly from source code; however, existing work assumes source code availability, which is not always the case. Many third-party libraries are distributed only as compiled artifacts, making source-code-based test generation difficult or impossible. Generating tests from compiled software could help developers evaluate the behavioural compatibility of dependency updates without access to the original codebase, but it remains unclear how well LLMs perform when only bytecode is accessible.

This research investigates LLM-based unit test generation using only bytecode. I developed an automated pipeline that generates, compiles, executes, and evaluates JUnit tests for Java libraries from disassembled and decompiled bytecode. I used this pipeline to study how model choice, representation, prompting, and temperature affect compilation, execution, and coverage. I also evaluated the best configuration using iterative prompting and compared it against EvoSuite.

Across 50 Java libraries, the best configuration achieved 89.5% compilation success and 83.6% execution success. Few-shot prompting and higher temperatures produced the strongest single-pass results, while iterative prompting nearly doubled coverage. Compared with EvoSuite, the approach produced usable tests for all 19 evaluated libraries, while EvoSuite succeeded on 9 but achieved higher coverage where it succeeded. These results suggest that bytecode-based LLM test generation is promising when source code is unavailable.

Files

AnnaGlodek_ThesisFinal.pdf
(pdf | 0.235 Mb)
License info not available