Evaluating the Impact of Software Context on the Quality of LLM-Generated Test Oracles

None, None

Evaluating the Impact of Software Context on the Quality of LLM-Generated Test Oracles

Bachelor Thesis (2026)

Author(s)

H. Klijn (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

A. Panichella – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

M.J.G. Olsthoorn – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

A. Voulimeneas – Graduation committee member (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Faculty

Electrical Engineering, Mathematics and Computer Science

Context Compression Software Testing Large Language Model

To reference this document use

https://resolver.tudelft.nl/uuid:c904e72d-3e2b-4521-8d2d-4bd5320234fa

More Info

expand_more

Publication Year

2026

Language

English

Graduation Date

23-06-2026

Awarding Institution

Delft University of Technology

Project

CSE3000 Research Project

Programme

Computer Science and Engineering

Faculty

Electrical Engineering, Mathematics and Computer Science

Downloads counter

9

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Testing software is essential for verifying that software is correct and behaves as intended. Large Language Models (LLMs) have shown promise in generating effective test oracles, which are defined as the mechanism used to determine the correctness of the behaviour for a given input to a System Under Test (SUT). Prior work has shown that the type of context provided to an LLM influences the quality of generated oracles. However, existing work often evaluates these oracles by comparing them to human-written assertions, which may not fully reflect real-world oracle quality. This paper investigates how different configurations of context types influence the quality of LLM-generated test oracles. We replicate prior work by evaluating eight context configurations using more realistic quantitative quality measures, including compilation rate, pass rate, mutation score, and test strength. Furthermore, we extend this evaluation by investigating whether compressed context can retain enough relevant information to generate useful oracles. The results suggest that including the focal class improves the quality of LLM-generated assertions the most among the evaluated context types. The effect of Javadoc is mixed: it improves results when available code context is limited. However, its effect is limited or even negative when richer code context is already available. Compression methods effectively reduce the number of tokens, but do not retain the full quality of the generated test oracles. The uncompressed configuration performs best overall. However, when context size is important, the test prefix paired with a summary provides a reasonable trade-off between oracle quality and token usage.

Files

Final_Paper.pdf

(pdf | 0.483 Mb)

License info not available