How does LLM-based test generation for Java libraries perform when full source code is available?

Evaluating LLM-based test generation for libraries across code representations

Bachelor Thesis (2026)
Author(s)

Cao Minh Nguyen Cao Minh (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

Cathrine Paulsen – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Sebastian Proksch – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Faculty
Electrical Engineering, Mathematics and Computer Science
More Info
expand_more
Publication Year
2026
Language
English
Graduation Date
26-06-2026
Awarding Institution
Delft University of Technology
Project
CSE3000 Research Project
Programme
Computer Science and Engineering
Faculty
Electrical Engineering, Mathematics and Computer Science
Downloads counter
4
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

As software projects depend heavily on open-source libraries, developers use tests to ensure that dependency updates remain behaviourally compatible. However, such library tests are often incomplete or unavailable. Although automated test generation tools such as EvoSuite exist and Large Language Models (LLMs) have shown promise in generating more readable tests, most evaluations have been conducted on benchmark datasets or popular GitHub projects. This creates a gap in understanding how effective LLM-generated tests are for released library artifacts. In this paper, we evaluate LLM-based test generation for released Java libraries from Maven Central to assess its feasibility in dependency validation workflows. We implement a pipeline that provides source code and method context to a locally hosted LLM, validates generated tests, and applies iterative repair when needed. Our results show that tests generated by the local model achieve substantially lower coverage than EvoSuite, primarily due to compilation failures, highlighting that symbol resolution errors remain a key challenge in generating tests with LLMs. We further show that iterative repair is effective at improving the coverage of generated tests and a stronger cloud-hosted model even surpasses EvoSuite in coverage. Overall, the findings indicate that LLM-based test generation from source code is a promising approach for dependency update validation when combined with sufficiently capable models and iterative repair mechanisms.

Files

License info not available