Evaluating the Impact of Software Context on the Quality of LLM-Generated Test Oracles
H. Klijn (TU Delft - Electrical Engineering, Mathematics and Computer Science)
A. Panichella – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)
M.J.G. Olsthoorn – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)
A. Voulimeneas – Graduation committee member (TU Delft - Electrical Engineering, Mathematics and Computer Science)
More Info
expand_more
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.
Abstract
Testing software is essential for verifying that software is correct and behaves as intended. Large Language Models (LLMs) have shown promise in generating effective test oracles, which are defined as the mechanism used to determine the correctness of the behaviour for a given input to a System Under Test (SUT). Prior work has shown that the type of context provided to an LLM influences the quality of generated oracles. However, existing work often evaluates these oracles by comparing them to human-written assertions, which may not fully reflect real-world oracle quality. This paper investigates how different configurations of context types influence the quality of LLM-generated test oracles. We replicate prior work by evaluating eight context configurations using more realistic quantitative quality measures, including compilation rate, pass rate, mutation score, and test strength. Furthermore, we extend this evaluation by investigating whether compressed context can retain enough relevant information to generate useful oracles. The results suggest that including the focal class improves the quality of LLM-generated assertions the most among the evaluated context types. The effect of Javadoc is mixed: it improves results when available code context is limited. However, its effect is limited or even negative when richer code context is already available. Compression methods effectively reduce the number of tokens, but do not retain the full quality of the generated test oracles. The uncompressed configuration performs best overall. However, when context size is important, the test prefix paired with a summary provides a reasonable trade-off between oracle quality and token usage.