Circular Image

M.J.G. Olsthoorn

info

Please Note

46 records found

Testing software is essential for verifying that software is correct and behaves as intended. Large Language Models (LLMs) have shown promise in generating effective test oracles, which are defined as the mechanism used to determine the correctness of the behaviour for a given input to a System Under Test (SUT). Prior work has shown that the type of context provided to an LLM influences the quality of generated oracles. However, existing work often evaluates these oracles by comparing them to human-written assertions, which may not fully reflect real-world oracle quality. This paper investigates how different configurations of context types influence the quality of LLM-generated test oracles. We replicate prior work by evaluating eight context configurations using more realistic quantitative quality measures, including compilation rate, pass rate, mutation score, and test strength. Furthermore, we extend this evaluation by investigating whether compressed context can retain enough relevant information to generate useful oracles. The results suggest that including the focal class improves the quality of LLM-generated assertions the most among the evaluated context types. The effect of Javadoc is mixed: it improves results when available code context is limited. However, its effect is limited or even negative when richer code context is already available. Compression methods effectively reduce the number of tokens, but do not retain the full quality of the generated test oracles. The uncompressed configuration performs best overall. However, when context size is important, the test prefix paired with a summary provides a reasonable trade-off between oracle quality and token usage. ...
Unit test assertions are essential for detecting software faults, yet writing them remains costly and time-consuming. Large Language Models (LLMs) offer a promising way to automate assertion generation. However, prior work has primarily focused on generating assertions that closely mimic human-written ones. Because this represents only one possible generation strategy, the impact of alternative approaches on overall quality remains poorly understood. This paper presents an empirical study evaluating four distinct generation strategies: Assertion Generation, which was proposed and evaluated in prior work, alongside Assertion Augmentation, Blind Augmentation, and Chain-of-Thought Generation. Using GPT-oss 20b as the underlying model, we evaluate these strategies on 811 test oracles from 10 open-source projects in the GitBug-Java benchmark. We assess the generated assertions in terms of correctness, fault-detection capability, and textual similarity to developer-written assertions. Our results show that the choice of generation strategy strongly influences performance. Assertion Augmentation performs best overall, achieving the highest compilation rate, execution validity, and mutation score. Meanwhile, Chain-of-Thought Generation detects the highest proportion of real bugs, and standalone Assertion Generation yields results most similar to developer-written tests. Overall, the findings demonstrate that providing LLMs with existing developer-written assertions substantially improves the quality and effectiveness of generated test oracles. ...
Bachelor thesis (2026) - H. Galitianu, A. Panichella, Mitchell Olsthoorn
Robust test assertions are critical for verifying deep semantic behavior, but their automated generation remains a primary bottleneck in software testing. Automated test case generation approaches often rely on implicit oracles or regression checks that miss semantic failures. Large language models (LLMs) can synthesize meaningful assertions, but single-pass prompting frequently produces uncompilable or failing code. We propose a multi-agent workflow for Java test assertion generation consisting of code comprehension, test objective planning, and assertion generation. The workflow extracts mutation-relevant variable manifests, structures high-level testing plans, compiles and executes the generated test candidates, and iteratively refines assertions using mutation-testing feedback from PITest to optimize mutation quality before final selection.

We evaluate the approach on 112 focal tests from twilio-java and liqp. Compared with static prompting, agentic configurations substantially improve reliability, increasing the percentage of valid runs (compilable, executable, and passing tests) from 58.1% to 84.8%. Relative to the human baseline, the agentic configuration raises the average Test Strength (the ratio of killed mutants to covered mutants) from 45.6% to approximately 56%. Our evaluation shows that while execution feedback significantly improves reliability and observed Test Strength, combining all agentic components does not yield the best computational trade-off. ...

Investigating the Relationship Between Syntactic and Semantic Equivalence in Human and LLM Test Assertions

Test assertions form a critical component of software tests, as they are the component that actually verifies whether the code under test is exhibiting the desired behaviour. However, writing test assertions is time consuming, and thus research has been carried out on how to help in automation of this task. Since the emergence of Large Language Models (LLMs) in recent years, interest in their application for assertion generation has grown. LLMs have shown promise, with LLM-generated assertions achieving mutation scores similar to human-written assertions. However, existing research evaluates the assertions based on either exact matches or mutation scores in isolation, thus not investigating the relationship between syntactic and semantic equivalence. This matters because syntactically different assertions can have the same semantics. Punishing the LLM for writing assertEquals(a, b) instead of assertTrue(a.equals(b)) leads to systematically under-reported LLM performance.

In this paper we investigated the extent to which LLM-generated assertions differ syntactically but remain semantically equivalent to human-written reference assertions. We construct a dataset with 177 filtered entries drawn from open source projects and generate assertions using gpt-oss-20b. We then measure the syntactic similarity via normalised tree edit distance and related metrics. We approximate the semantic similarity based on Jaccard and Ochiai similarity between the sets of mutants killed with PIT mutation testing. We find a moderately strong correlation between normalised tree edit distance and the Jaccard similarity of the killed mutants (ρ = -0.685, p < 0.001), indicating that the two metrics are related but not interchangeable. Open coding of 41 semantically equivalent but syntactically different pairs revealed ten transformation categories. The LLM showed a universal preference for the omission of assertion messages and for replacing boolean checks with equality assertions. We use open coding to evaluate the syntactic differences between semantically equivalent assertions. Finally we use a decision tree to generate a threshold allowing us to effectively distinguish between datapoints likely and unlikely to be semantically equivalent. We find this threshold to be 0.41 for the normalised tree edit distance, showing a median Jaccard similarity of 0.5290 below it and a median of 1.000 above it. Our findings suggest that exact match evaluation significantly underestimates LLM assertion generation performance, and that syntactic similarity with a fixed threshold offers a more useful metric for assertion quality. ...

Evaluating Fine-Tuned CodeT5 Models on Assertion Generation Quality and Efficiency

Testing is a core practice in software development for detecting faults and checking that code behaves as expected. With the recent advent of Large Language Models (LLMs), code generation has never been more widespread. In assertion generation, where the focus is on the oracles that assess the state of the program, fine-tuned code language models have emerged. One such model, AsserT5, is a CodeT5-large (770M parameters) fine-tuned on focal-method and test-method pairs. Although it achieves state-of-the-art performance when measured by exact match to the ground truth, it remains unclear how the top-1 predictions of the smaller variants (CodeT5-small, 60M; CodeT5-base, 220M) perform on mutation score when the same fine-tuning procedure is applied.

Across ten real-world Java projects and 541 assertion-generation tasks, we find that the fine-tuned 60M CodeT5-small matches the 220M and 770M variants on mutation score (within 0.2 p.p.), achieving the highest score of the three by generating more assertions that compile. Among the larger code-specific baselines (Qwen2.5-Coder 3B, 7B, and 14B), CodeT5-small underperforms only the 14B model, and only by 0.6 p.p. This advantage is concentrated in just two of the ten projects, and the 14B model attains it at the cost of 38x more memory (9.00 GB vs 0.24 GB) and 2.6x slower inference. Because the difference is small and confined to two out of ten projects, we recommend the fine-tuned CodeT5-small to practitioners seeking local assertion-generation assistance at reasonable computational cost. ...
Master thesis (2026) - L. Negru, M.J.G. Olsthoorn, A. Panichella, C. Lofi
Automatic test case generation for dynamically typed languages such as JavaScript is significantly hindered by the absence of explicit type information, which expands the search space for search-based testing and reduces its effectiveness. While prior probabilistic and neural type inference methods address this, they struggle with complex user-defined types, higher-order functions, and external package dependencies. This paper presents and evaluates three LLM-based approaches for type inference in JavaScript. The primary contribution is a Retrieval-Augmented Generation (RAG) approach that constructs a vector database of semantically rich code embeddings. These embeddings include ASTs, program slices, and code annotations. This enables efficient, project-wide context retrieval paired with Chain-of-Thought prompting. In a large-scale empirical evaluation against the SynTest framework, the RAG approach achieves a 29% average accuracy improvement over non-RAG LLM approaches, an 85% reduction in computation time, and a 63% accuracy improvement over probabilistic inference for deep, user-defined types. For primitive types, probabilistic methods remain competitive. These findings motivate future hybrid strategies combining probabilistic and LLM-based inference.

https://doi.org/10.5281/zenodo.19496755 Repository link
Replication package of "Retrieval First: LLM- Assisted Type Inference for Automatic Test Case Generation in JavaScript" ...

An Empirical Evaluation of LLM Architectures on the AICrypto Benchmark

Bachelor thesis (2026) - Q.W. Voet, Z. Erkin, M.J.G. Olsthoorn
Large Language Models (LLMs) demonstrate gold-medal performance in pure mathematics but continue to struggle in professional Capture-The-Flag (CTF) cybersecurity competitions, where the goal is to obtain a flag string as proof. While models can solve textbook equations, the iterative engineering required for cryptographic challenges often exceeds their single-pass capabilities. We explore whether this performance gap is a limitation of the base models or if it can be bridged by using different agent architectures. This paper presents a systematic evaluation of five distinct control flow architectures (i.e., how the agent plans, uses tools, and iterates) applied to the AICrypto CTF benchmark: Chain-Of-Thought (CoT) as the single-pass baseline, ReAct as a reactive method, and three planning/search approaches (ReWOO, ADaPT, and LATS). To isolate the impact of reasoning structures, we restrict the scope to static challenges, which are puzzles that do not require network interaction. Using a fixed base model and tool stack (SageMath execution, command execution and flag submission), we analyze the trade-offs between success rate, time-to-solve, and token consumption to determine if the computational overhead of different architectures contributes to the solving capability. Overall, there is only a 5.4% gain in success rate by using agent architectures compared to the single-pass baseline (35.1% vs. 29.7% success rate), primarily by iteratively debugging solutions rather than discovering new cryptographic ideas. The higher success rates from complex planning/search architectures come at the expense of higher token and time costs, with ADaPT consuming 93% more tokens on average compared to the baseline in successful challenges. ...
Master thesis (2025) - S.R. Sunnevudóttir, A. van Deursen, M.J.G. Olsthoorn, Pouria Derakhshanfar, M.A. Costea
Automated test generation is a critical area of research in software engineering, aiming to reduce manual effort while improving software reliability. While substantial work has focused on statically typed languages, dynamically typed languages such as JavaScript remain underexplored despite their widespread use and unique challenges. This thesis investigates the current status of JavaScript test generation by systematically evaluating state-of-the-art search-based and large language model-based tools.

We first analyze existing benchmarks to assess their coverage of representative language features, identifying gaps that limit the ability to fairly compare tool performance. We then construct a curated dataset of real-world JavaScript projects and evaluate the LLM-based tool TestPilot and the search-based tool SynTest using a combination of quantitative metrics (e.g., code coverage, pass rates) and feature based correlation analysis. Our results reveal that TestPilot tends to generate higher coverage (median 27.9\% vs 11.2\% branch coverage) and more readable tests but produces a larger number of failing or low-value test cases, while SynTest generates more stable and focused test suites yet can struggle with complex or dynamic code constructs. Our similarity analysis shows that each approach achieved unique coverage, suggesting complementary strengths.

This study highlights the need for standardized, language-aware benchmarks and introduces a curated dataset and evaluation framework for evaluating JavaScript test generation tools. By systematically comparing search-based and LLM-based approaches, this thesis offers insights into their respective strengths, limitations, and opportunities for hybrid strategies, advancing the state of automated testing for dynamically typed languages. ...
Search-Based Software Testing (SBST) tools can automatically generate tests to achieve high code coverage; however, a systematic understanding of why they fail in specific situations is necessary. This thesis addresses this gap by developing a comprehensive taxonomy of coverage failures through an empirical analysis of the three most prominent SBST tools: Pynguin (Python), SynTest (JavaScript), and EvoSuite (Java). By classifying and analysing failure patterns across these tools and language paradigms, this research provides a foundational framework to diagnose shortcomings, prioritise future development, and enhance the practical effectiveness of automated test generation. ...
Testing of software is crucial to the quality of the final product manual test assertion creation has become a significant bottleneck in the development process, which delays release. Having shown promise in generating assertions automatically, Large language models (LLMs) have showed promise in generating assertions automatically. This is due to their fluency in both natural languages and code, as well as the fact that they produce tests a lot faster than a developer would. However, LLMs must reckon with deployment issues that come with the high computation time and latency of large models, or the limited functionality of their smaller, locally-executable counterparts. Knowledge distillation, a technique that aims to "transfer knowledge" from a teacher model to a student one, can thus enable the potential of smaller and faster models. This drives the research to explore the effectiveness of knowledge distillation in developing a smaller and efficient model for assertion generation. With CodeT5 as the teacher model, the student model learns from the teacher. The student is iteratively trained in epochs, validated on unseen data. The metrics used to evaluate include assertion accuracy, similarity to teacher model output and ground truth, model size, inference time, with the goal to quantify the trade-offs and determine the feasibility of distilled models for practical assertion generation. We presented and analyzed the results we achieved. The capability the student showed was around 1/3 of that of the teacher, which suggest a potential for creating efficient, yet reliable assertion generation tools. ...
The decentralized nature of blockchain systems makes them prone to concurrency bugs, which are difficult to detect. There exist testing techniques to find these bugs, such as systematic exploration of the solution space, but these techniques are difficult to scale. Evolutionary algorithms have been proposed as an effective solution to find these bugs. In this research, we aim to discover the influence of evolutionary operators in the bug detection performance of evolutionary algorithms. We test this on the XRP Ledger Consensus Protocol (XRP LCP) using priority-based event representation. We present Groot, an evolutionary algorithm that is implemented using a modified version of the Rocket framework. We experimented with two combinations of operators: the Simulated Binary Crossover (SBX) operator with the Gaussian mutation operator and the Laplace Crossover (LX) operator with the Makinen, Periaux and Toivanen Mutation (MPTM) operator. We evaluated these setups using effectiveness and efficiency and compared them to a random baseline. We used a bug-seeded version of the XRP LCP to run the experiments of these setups. We discovered that all setups are capable of detecting bugs in the XRP LCP. The results indicate that the effectiveness and efficiency is not influenced by the choice of these operators in a significant way. We discuss that possible reasons for these discoveries include noise in the fitness function, event representation limitations and configuration choices that may have contributed to these results. ...
The XRP Ledger (XRPL) relies on a Byzantine fault-tolerant consensus algorithm to ensure global agreement on transactions across distributed nodes. Despite its critical financial role, the implementation remains under-tested. While prior work has shown the potential of evolutionary testing to uncover potential consensus violations in XRPL, the role of genetic operator selection in this process remains unexplored. We address this research gap by presenting a comparative evaluation of four evolutionary configurations that differ in their balance of exploration and exploitation. The system is tested by injecting network delays to simulate adverse conditions and trigger violations. Our results show that the balance of exploration and exploitation affects the performance of bug detection: configurations that favor exploitation, complemented by subtle exploration, yield the most favorable results. In addition, we contribute an extensible testing method tailored to XRPL but applicable to other distributed systems. ...
Effective test assertions are important for software quality, but their creation is time-consuming. While Large Language Models (LLMs) show promise in automated assertion generation, their size, cost, resource demands, and need for online connection often render them impractical for widespread developer use. Knowledge Distillation (KD) offers a solution to bridge that gap by transferring capabilities from a large "teacher" LLM to smaller "student" models (SLMs). However, the majority of the ground work on KD has been focused on classification tasks and not on generative problems. This paper investigates the feasibility of a test assertion generation task using response-based Knowledge Distillation (KD) from a CodeT5-base teacher. We specifically explore the impact of three parameters on assertion quality and model efficiency - those being student model size (number of layers), pretraining initialization, and loss weighting. Our results demonstrate that distilled small student models (231 MB), particularly those initialized from pretrained checkpoints and fine-tuned with specific loss weight (α = 0.5) for the ground truth and distillation losses, can retain a significant portion of the teacher's assertion generation performance when considering the defined metrics - achieving around 83.9% of the CodeBERTScore of the teacher with just 25.9% of the size. This work provides empirical insights into creating specialized SLMs for test assertion generation, highlighting practical configurations for deployment in development environments. ...

Evaluating Fitness Functions in Priority-Based Evolutionary Testing for the XRP Ledger Consensus Protocol

The XRP Ledger Consensus Protocol is a Byzantine fault-tolerant algorithm that enables the XRP Ledger to reach agreement on which transactions to apply, supporting millions of transactions daily. While the protocol is correct by design, its practical implementation is vulnerable to concurrency-related bugs triggered by nondeterministic message delivery between distributed validator nodes. These bugs are subtle and difficult to expose through conventional testing. In this paper, we investigate the use of evolutionary concurrency testing combined with a priority-based message scheduling strategy to explore different message interleavings. Specifically, we evaluate multiple fitness functions and assess their ability to guide the search toward buggy executions. Our results show that EvoPriority, which applies evolutionary testing to priority-based schedules, is difficult to guide toward buggy executions regardless of the fitness function used. Although it is capable of uncovering violations on a bug-seeded versions of the XRP Ledger Consensus Protocol, its performance is similar to randomized concurrency testing. ...

Evaluating Fitness Functions for Concurrency Testing on the XRPL Consensus Protocol

Distributed systems, such as blockchains, can have bugs around edge-cases that are hard to detect or trigger. Previous publications have introduced guided-search testing approaches that are able to find edge cases more efficiently than through conducting a systematic and exhaustive search. In this paper, we compare the effectiveness of fitness functions in evolutionary testing frameworks. For this we evaluate time and proposal fitness. While evolutionary testing frameworks are not new to the domain of concurrency testing consensus algorithms, the impact of the fitness functions that underpin them remains poorly understood. We use the XRPL consensus algorithm as a case study to evaluate the fitness functions using the Rocket testing framework. For this, we make use of seeded versions of XRPL. All evaluated fitness functions have been able to detect the bugs we seeded in the source code of the XRPL consensus algorithm. We show the validity of various fitness functions in trying to find the bug and analyze effects in the interplay between the time and proposal fitness functions we examine. ...

Combining response-based distillation and architectural tuning to deliver near-teacher quality on resource-constrained devices

Writing clear, semantically rich test assertions remains a major bottleneck in software development. While large pre-trained models such as CodeT5 excel at synthesizing assertions, their size and latency make them impractical for on-premise or resourceconstrained workflows. In this work, we introduce a knowledgedistillation pipeline that transfers knowledge from CodeT5-base, a pre-trained encoder–decoder Transformer model based on the T5 architecture, into sub-1 GB student models tailored specifically for test-assertion generation. Our pipeline combines response-based distillation using soft labels and hard-label fine-tuning, and incorporates custom student architectures comparing pre-trained models vs random initialization, along with targeted regularization techniques. We instantiate students at various size points and conduct an empirical evaluation on standard assertion benchmarks, measuring exact-match accuracy, similarity, RAM footprint, and CPU and GPU inference latency. Our best 230 MB student retains over 80% of the teacher’s assertion-generation accuracy on exact matches and over 90% of the similarity, while having an inference time of under 3 seconds on a single consumer-grade CPU, with a 75% reduction in RAM usage. These results demonstrate that distilled code-LLMs can deliver near-teacher assertion quality under tight memory and latency constraints, paving the way for fully on-device IDE integration and low-overhead continuous-integration workflows. ...

Distilling CodeT5+ for Reduced Model Size and High Accuracy

Bachelor thesis (2025) - D. Wu, M.J.G. Olsthoorn, A. Panichella, P. Kellnhofer
Effective software testing relies on the quality and correctness of test assertions. Recent Large Language Models (LLMs), such as CodeT5+, have shown significant promise in automating assertion generation tasks; however, their substantial computational resource demands limit their practical deployment in common development environments like laptops or local IDEs. To address this challenge, this work explores knowledge distillation to derive smaller, more efficient student models from a larger, pre-trained CodeT5+ teacher. While knowledge distillation has been successfully applied to general code models, its specific application to creating lightweight, locally-deployable models for test assertion generation remains a recognized research gap. Using a dataset that includes assertion input-output pairs and teacher logits, we systematically investigate the impact of different distillation loss components—soft logits loss and hard target losses—on student performance. Our findings demonstrate the practical viability of this approach: a distilled 220M parameter student model can be nearly 3x faster and consume over 40% less memory than its 770M teacher, while retaining approximately 78% of the original’s assertion generation quality as measured by CodeBLEU. These results offer practical insights and a clear pathway for deploying efficient yet effective assertion-generation models suitable for local developer workflows. ...
Writing test cases is an important yet complex task. Search-Based Software Testing (SBST) is an automated test case generation technique that aims to help developers by creating high-coverage test cases. Despite its strengths, a major limitation of this technique is that it often struggles with generating test cases that contain complex method sequences, as they have no semantic understanding of which methods are related. The recent advancement of Large Language Models (LLMs) offers a potential solution due to their natural language capabilities and applicability to software engineering tasks. However, LLMs often end up with lower coverage than SBST when directly compared due to lacking the output diversity that is required in software testing. This opens an opportunity to combine the exploratory power of SBST, with the semantic understanding of LLMs to make the test case generation process more effective in terms of test coverage and test structure. This thesis investigates the combination of these methodologies with our hybrid approach, LLM-Seeded Evolutionary Testing (LSET), which uses LLMs to generate tests that contain complex method sequences, and introduces this structure into the SBST process by serving as the starting point (seeds) of the algorithm to be evolved further. We conducted an empirical evaluation on a benchmark consisting of 35 JavaScript classes and found a significant branch coverage increase on 19 and 14 classes when compared to SBST and LLM-only baselines. However, when taking the combination of final SBST and LLM test suites, this gap reduces to 1 out of 35 classes. Beyond coverage, we also found a positive effect on structure when compared to tests generated by SBST, as the structure provided by the LLM tests can be further evolved to reach deeper branches while maintaining readability.
...
Blockchain systems rely on consensus protocols to ensure agreement among nodes even in the presence of malicious or faulty nodes. A consensus protocol that provides safety and liveness guarantees under such conditions is known as a Byzantine fault‑tolerant (BFT) protocol. Various whitepapers describe the design and implementation of BFT protocols, providing formal proofs of their safety and liveness properties. However, in practice such protocols are difficult to implement correctly and often contain subtle logic errors that cause consensus violations. Ensuring correctness is especially important for the XRP Ledger—a public, decentralized blockchain that processes transactions for the widely used cryptocurrency XRP. Thorough testing of consensus protocols is crucial, but systematic testing is challenging and time-consuming due to the large number of possible network and timing configurations.
In this paper, we present a replication package for evaluating randomized Byzantine fault tolerance testing on the XRP Ledger Consensus Protocol (LCP). We implement the ByzzFuzz search algorithm and compare it to naive random testing. Additionally, we investigate the impact of different hyperparameter configurations on the performance of the ByzzFuzz algorithm. Our experimental results demonstrate that both naive random testing and the ByzzFuzz algorithm detect seeded bugs in the XRP LCP, with ByzzFuzz algorithm uncovering more agreement violations. Finally, we identify the most effective hyperparameter configurations for the ByzzFuzz algorithm. ...
Bachelor thesis (2024) - D.A. Turhan, M.J.G. Olsthoorn, A. Panichella
Software testing is a vital yet time consuming process during the development lifecycle, often causing engineers to limit its use in practice. In order to encourage active software testing, researchers have shown significant advances in automatic unit test case gener- ation with approaches such as search-based testing (i.e., EvoSuite) and large language models (i.e., ChatGPT). However, while the first suffers with exploring edge cases of the input space, the latter still suffers from hallucinations during code synthesis, limiting the use of both solutions. This research aims to overcome these limitations by utilizing the strengths of both techniques, which are effective test structure generation and program inference, respectively. In particular, the assertions of initial unit tests generated by EvoSuite are augmented using ChatGPT-4o, with the aim of improving the mutation score, and hence the overall test suite effectiveness. We evaluate our solution, called evoLLve’M, on a benchmark of 20 Java classes from the SourceForge110 Corpus and compare it to only using EvoSuite, which is considered the state-of-the-art ap- proach. Results show that evoLLve’M outperforms EvoSuite in 25% of the classes for mutation score, without negatively impacting other classes. It boosts the total number of killed mutations by 3%, achieving the most improvement for mutations types of increments and null returns, being 26.9% and 8.9%, respectively. ...