M.J.G. Olsthoorn
Please Note
46 records found
1
We evaluate the approach on 112 focal tests from twilio-java and liqp. Compared with static prompting, agentic configurations substantially improve reliability, increasing the percentage of valid runs (compilable, executable, and passing tests) from 58.1% to 84.8%. Relative to the human baseline, the agentic configuration raises the average Test Strength (the ratio of killed mutants to covered mutants) from 45.6% to approximately 56%. Our evaluation shows that while execution feedback significantly improves reliability and observed Test Strength, combining all agentic components does not yield the best computational trade-off. ...
We evaluate the approach on 112 focal tests from twilio-java and liqp. Compared with static prompting, agentic configurations substantially improve reliability, increasing the percentage of valid runs (compilable, executable, and passing tests) from 58.1% to 84.8%. Relative to the human baseline, the agentic configuration raises the average Test Strength (the ratio of killed mutants to covered mutants) from 45.6% to approximately 56%. Our evaluation shows that while execution feedback significantly improves reliability and observed Test Strength, combining all agentic components does not yield the best computational trade-off.
Beyond the Exact Match
Investigating the Relationship Between Syntactic and Semantic Equivalence in Human and LLM Test Assertions
In this paper we investigated the extent to which LLM-generated assertions differ syntactically but remain semantically equivalent to human-written reference assertions. We construct a dataset with 177 filtered entries drawn from open source projects and generate assertions using gpt-oss-20b. We then measure the syntactic similarity via normalised tree edit distance and related metrics. We approximate the semantic similarity based on Jaccard and Ochiai similarity between the sets of mutants killed with PIT mutation testing. We find a moderately strong correlation between normalised tree edit distance and the Jaccard similarity of the killed mutants (ρ = -0.685, p < 0.001), indicating that the two metrics are related but not interchangeable. Open coding of 41 semantically equivalent but syntactically different pairs revealed ten transformation categories. The LLM showed a universal preference for the omission of assertion messages and for replacing boolean checks with equality assertions. We use open coding to evaluate the syntactic differences between semantically equivalent assertions. Finally we use a decision tree to generate a threshold allowing us to effectively distinguish between datapoints likely and unlikely to be semantically equivalent. We find this threshold to be 0.41 for the normalised tree edit distance, showing a median Jaccard similarity of 0.5290 below it and a median of 1.000 above it. Our findings suggest that exact match evaluation significantly underestimates LLM assertion generation performance, and that syntactic similarity with a fixed threshold offers a more useful metric for assertion quality. ...
In this paper we investigated the extent to which LLM-generated assertions differ syntactically but remain semantically equivalent to human-written reference assertions. We construct a dataset with 177 filtered entries drawn from open source projects and generate assertions using gpt-oss-20b. We then measure the syntactic similarity via normalised tree edit distance and related metrics. We approximate the semantic similarity based on Jaccard and Ochiai similarity between the sets of mutants killed with PIT mutation testing. We find a moderately strong correlation between normalised tree edit distance and the Jaccard similarity of the killed mutants (ρ = -0.685, p < 0.001), indicating that the two metrics are related but not interchangeable. Open coding of 41 semantically equivalent but syntactically different pairs revealed ten transformation categories. The LLM showed a universal preference for the omission of assertion messages and for replacing boolean checks with equality assertions. We use open coding to evaluate the syntactic differences between semantically equivalent assertions. Finally we use a decision tree to generate a threshold allowing us to effectively distinguish between datapoints likely and unlikely to be semantically equivalent. We find this threshold to be 0.41 for the normalised tree edit distance, showing a median Jaccard similarity of 0.5290 below it and a median of 1.000 above it. Our findings suggest that exact match evaluation significantly underestimates LLM assertion generation performance, and that syntactic similarity with a fixed threshold offers a more useful metric for assertion quality.
Can Small Beat Big?
Evaluating Fine-Tuned CodeT5 Models on Assertion Generation Quality and Efficiency
Across ten real-world Java projects and 541 assertion-generation tasks, we find that the fine-tuned 60M CodeT5-small matches the 220M and 770M variants on mutation score (within 0.2 p.p.), achieving the highest score of the three by generating more assertions that compile. Among the larger code-specific baselines (Qwen2.5-Coder 3B, 7B, and 14B), CodeT5-small underperforms only the 14B model, and only by 0.6 p.p. This advantage is concentrated in just two of the ten projects, and the 14B model attains it at the cost of 38x more memory (9.00 GB vs 0.24 GB) and 2.6x slower inference. Because the difference is small and confined to two out of ten projects, we recommend the fine-tuned CodeT5-small to practitioners seeking local assertion-generation assistance at reasonable computational cost. ...
Across ten real-world Java projects and 541 assertion-generation tasks, we find that the fine-tuned 60M CodeT5-small matches the 220M and 770M variants on mutation score (within 0.2 p.p.), achieving the highest score of the three by generating more assertions that compile. Among the larger code-specific baselines (Qwen2.5-Coder 3B, 7B, and 14B), CodeT5-small underperforms only the 14B model, and only by 0.6 p.p. This advantage is concentrated in just two of the ten projects, and the 14B model attains it at the cost of 38x more memory (9.00 GB vs 0.24 GB) and 2.6x slower inference. Because the difference is small and confined to two out of ten projects, we recommend the fine-tuned CodeT5-small to practitioners seeking local assertion-generation assistance at reasonable computational cost.
https://doi.org/10.5281/zenodo.19496755 Repository link
Replication package of "Retrieval First: LLM- Assisted Type Inference for Automatic Test Case Generation in JavaScript" ...
https://doi.org/10.5281/zenodo.19496755 Repository link
Replication package of "Retrieval First: LLM- Assisted Type Inference for Automatic Test Case Generation in JavaScript"
Do Agent Architectures Matter for Crypto CTFs?
An Empirical Evaluation of LLM Architectures on the AICrypto Benchmark
We first analyze existing benchmarks to assess their coverage of representative language features, identifying gaps that limit the ability to fairly compare tool performance. We then construct a curated dataset of real-world JavaScript projects and evaluate the LLM-based tool TestPilot and the search-based tool SynTest using a combination of quantitative metrics (e.g., code coverage, pass rates) and feature based correlation analysis. Our results reveal that TestPilot tends to generate higher coverage (median 27.9\% vs 11.2\% branch coverage) and more readable tests but produces a larger number of failing or low-value test cases, while SynTest generates more stable and focused test suites yet can struggle with complex or dynamic code constructs. Our similarity analysis shows that each approach achieved unique coverage, suggesting complementary strengths.
This study highlights the need for standardized, language-aware benchmarks and introduces a curated dataset and evaluation framework for evaluating JavaScript test generation tools. By systematically comparing search-based and LLM-based approaches, this thesis offers insights into their respective strengths, limitations, and opportunities for hybrid strategies, advancing the state of automated testing for dynamically typed languages. ...
We first analyze existing benchmarks to assess their coverage of representative language features, identifying gaps that limit the ability to fairly compare tool performance. We then construct a curated dataset of real-world JavaScript projects and evaluate the LLM-based tool TestPilot and the search-based tool SynTest using a combination of quantitative metrics (e.g., code coverage, pass rates) and feature based correlation analysis. Our results reveal that TestPilot tends to generate higher coverage (median 27.9\% vs 11.2\% branch coverage) and more readable tests but produces a larger number of failing or low-value test cases, while SynTest generates more stable and focused test suites yet can struggle with complex or dynamic code constructs. Our similarity analysis shows that each approach achieved unique coverage, suggesting complementary strengths.
This study highlights the need for standardized, language-aware benchmarks and introduces a curated dataset and evaluation framework for evaluating JavaScript test generation tools. By systematically comparing search-based and LLM-based approaches, this thesis offers insights into their respective strengths, limitations, and opportunities for hybrid strategies, advancing the state of automated testing for dynamically typed languages.
EvoPriority
Evaluating Fitness Functions in Priority-Based Evolutionary Testing for the XRP Ledger Consensus Protocol
Survival of the Fittest
Evaluating Fitness Functions for Concurrency Testing on the XRPL Consensus Protocol
Distilling CodeT5 for Efficient On-Device Test-Assertion Generation
Combining response-based distillation and architectural tuning to deliver near-teacher quality on resource-constrained devices
Efficient Local Test Assertion Generation
Distilling CodeT5+ for Reduced Model Size and High Accuracy
...
In this paper, we present a replication package for evaluating randomized Byzantine fault tolerance testing on the XRP Ledger Consensus Protocol (LCP). We implement the ByzzFuzz search algorithm and compare it to naive random testing. Additionally, we investigate the impact of different hyperparameter configurations on the performance of the ByzzFuzz algorithm. Our experimental results demonstrate that both naive random testing and the ByzzFuzz algorithm detect seeded bugs in the XRP LCP, with ByzzFuzz algorithm uncovering more agreement violations. Finally, we identify the most effective hyperparameter configurations for the ByzzFuzz algorithm. ...
In this paper, we present a replication package for evaluating randomized Byzantine fault tolerance testing on the XRP Ledger Consensus Protocol (LCP). We implement the ByzzFuzz search algorithm and compare it to naive random testing. Additionally, we investigate the impact of different hyperparameter configurations on the performance of the ByzzFuzz algorithm. Our experimental results demonstrate that both naive random testing and the ByzzFuzz algorithm detect seeded bugs in the XRP LCP, with ByzzFuzz algorithm uncovering more agreement violations. Finally, we identify the most effective hyperparameter configurations for the ByzzFuzz algorithm.