A.E. Zaidman
Please Note
45 records found
1
We curated a dataset of open-source systems, detected smells with tsDetect and manually validated refactorable instances. For each instance, we applied literature-backed, smell-specific refactorings and measured energy with EnergiBridge under controlled conditions. The results show that energy effects are smell-specific. Removing Ignored test instances yields clear energy savings, whereas refactoring the Lazy test (JUnit 5) smell via @ParameterizedTest incurs substantial energy increases. Most other smells exhibit small or inconsistent changes. Interestingly, we found that changes in energy were also strongly coupled with changes in execution time, within our evaluation context (a controlled, CPU-bound, sequential JUnit setting).
Overall, this study extends test smell research into software sustainability and highlights trade-offs between maintainability and energy efficiency. It provides a reproducible measurement pipeline and empirical guidance on when refactoring test smells is likely to be energy-beneficial. ...
We curated a dataset of open-source systems, detected smells with tsDetect and manually validated refactorable instances. For each instance, we applied literature-backed, smell-specific refactorings and measured energy with EnergiBridge under controlled conditions. The results show that energy effects are smell-specific. Removing Ignored test instances yields clear energy savings, whereas refactoring the Lazy test (JUnit 5) smell via @ParameterizedTest incurs substantial energy increases. Most other smells exhibit small or inconsistent changes. Interestingly, we found that changes in energy were also strongly coupled with changes in execution time, within our evaluation context (a controlled, CPU-bound, sequential JUnit setting).
Overall, this study extends test smell research into software sustainability and highlights trade-offs between maintainability and energy efficiency. It provides a reproducible measurement pipeline and empirical guidance on when refactoring test smells is likely to be energy-beneficial.
With Great Power Comes Great Responsibility
A Tool for Energy-Aware Java Development
This thesis investigates maintainer reception of agentic AI pull requests by actively submitting 90 pull requests to 45 open-source repositories across Python, TypeScript, and Java, targeting good first issues: tasks traditionally reserved for newcomers making their first contribution to a project. Contributions are structured along two dimensions: whether the repository has explicitly configured agentic AI tooling in its development workflow, and whether the use of AI assistance is disclosed in the pull request. This yields three contribution types, covering disclosed and undisclosed submissions to repositories without explicit AI configuration and disclosed submissions to repositories that have integrated agentic AI tooling. A mixed-methods approach is applied, combining quantitative analysis of acceptance rates and review activity with a qualitative thematic analysis of maintainer feedback.
The results show that acceptance rates differed across contribution types, with repositories that had explicitly integrated agentic AI tooling showing a statistically significantly lower acceptance rate compared to standard repositories with disclosed AI assistance, though not relative to the undisclosed group. Across all groups, staleness accounted for the majority of non-merged pull requests, suggesting that non-engagement was a more common outcome than active rejection. Disclosing AI assistance made no meaningful difference to acceptance rates within the same repository context. No statistically significant differences were found in the volume of reviews or comments across groups, although automated bots contributed a notable share of interactions, particularly in repositories with agentic tooling integration. Thematic analysis of maintainer feedback showed that code quality and implementation correctness were the dominant concerns across all groups, while explicit distrust of AI-generated contributions remained low. When maintainers did reject contributions on AI-related grounds, the concern was typically the degree of human oversight behind the submission rather than AI use itself. Several repositories also introduced or revised AI policies during the contribution period, reflecting how actively norms in this space are still evolving. ...
This thesis investigates maintainer reception of agentic AI pull requests by actively submitting 90 pull requests to 45 open-source repositories across Python, TypeScript, and Java, targeting good first issues: tasks traditionally reserved for newcomers making their first contribution to a project. Contributions are structured along two dimensions: whether the repository has explicitly configured agentic AI tooling in its development workflow, and whether the use of AI assistance is disclosed in the pull request. This yields three contribution types, covering disclosed and undisclosed submissions to repositories without explicit AI configuration and disclosed submissions to repositories that have integrated agentic AI tooling. A mixed-methods approach is applied, combining quantitative analysis of acceptance rates and review activity with a qualitative thematic analysis of maintainer feedback.
The results show that acceptance rates differed across contribution types, with repositories that had explicitly integrated agentic AI tooling showing a statistically significantly lower acceptance rate compared to standard repositories with disclosed AI assistance, though not relative to the undisclosed group. Across all groups, staleness accounted for the majority of non-merged pull requests, suggesting that non-engagement was a more common outcome than active rejection. Disclosing AI assistance made no meaningful difference to acceptance rates within the same repository context. No statistically significant differences were found in the volume of reviews or comments across groups, although automated bots contributed a notable share of interactions, particularly in repositories with agentic tooling integration. Thematic analysis of maintainer feedback showed that code quality and implementation correctness were the dominant concerns across all groups, while explicit distrust of AI-generated contributions remained low. When maintainers did reject contributions on AI-related grounds, the concern was typically the degree of human oversight behind the submission rather than AI use itself. Several repositories also introduced or revised AI policies during the contribution period, reflecting how actively norms in this space are still evolving.
Using LLM-Generated Summarizations to Improve the Understandability of Generated Unit Tests
Enhancing Unit Test Understandability: An Evaluation of LLM-Generated Summaries
Leveraging E2E Test Context for LLM-Enhanced Test Data and Descriptions
Enhancing Automated Software Testing with Runtime Data Integration
We conducted a comparative user study with 9 participants using UTGen+, original UTGen, and conventional SBST (EvoSuite), focusing on the effects of trace log inclusion on the naturalness and relevancy of comments, identifiers, and test data across several projects. The results indicated that while UTGen+ did not improve the naturalness and relevancy of comments and identifiers, it significantly enhanced the relevancy of test data. These findings suggest that incorporating contextual data can indeed benefit the generation of more relevant and understandable automated test cases. ...
We conducted a comparative user study with 9 participants using UTGen+, original UTGen, and conventional SBST (EvoSuite), focusing on the effects of trace log inclusion on the naturalness and relevancy of comments, identifiers, and test data across several projects. The results indicated that while UTGen+ did not improve the naturalness and relevancy of comments and identifiers, it significantly enhanced the relevancy of test data. These findings suggest that incorporating contextual data can indeed benefit the generation of more relevant and understandable automated test cases.
Reducing LLM Hallucinations with Retrieval Prompt Engineering
Minimising the Need for Re-prompting in Automatic Understandable Test Generation
The current hallucination handling of UTGen is time-consuming and resource-expensive. To address this, we propose two alternative approaches that use information retrieval prompt engineering techniques to minimise hallucinations. Our respective techniques include incorporating the source code under test and the errors thrown by the latest generated test case to the LLM prompt. We assess our methods through a comparison study against the base UTGen version. We observe that source code retrieval enhances the generation of compilable test cases for complex classes. Error code retrieval shows similar hallucination performance to base UTGen, with a decrease in the number of re-prompts for classes with a high normalised Lack of Cohesion of Methods (*LCOM).
Index Terms - Automated Test Generation, Large Language Models (LLMs), LLM Hallucination, Prompt Engineering ...
The current hallucination handling of UTGen is time-consuming and resource-expensive. To address this, we propose two alternative approaches that use information retrieval prompt engineering techniques to minimise hallucinations. Our respective techniques include incorporating the source code under test and the errors thrown by the latest generated test case to the LLM prompt. We assess our methods through a comparison study against the base UTGen version. We observe that source code retrieval enhances the generation of compilable test cases for complex classes. Error code retrieval shows similar hallucination performance to base UTGen, with a decrease in the number of re-prompts for classes with a high normalised Lack of Cohesion of Methods (*LCOM).
Index Terms - Automated Test Generation, Large Language Models (LLMs), LLM Hallucination, Prompt Engineering
Exploring Test Suite Coverage of Large Language Model–Enhanced Unit Test Generation
A Study on the Ability of Large Language Models to Improve the Understandability of Generated Unit Tests Without Compromising Coverage
To streamline bug detection by developers, we propose UTGenCov, a concept that focuses on improving the understandability of EvoSuite-generated tests without compromising on coverage. This approach builds upon UTGen by thoroughly analyzing the reasons behind the decrease in coverage and proposing an alternative approach.
Our investigation determined that the leading cause of coverage reduction in UTGen is LLM hallucination in the Understandability phase. UTGenCov aims to address hallucinations by providing the source code of the methods used in the test to the LLM. Yet, our experiment results indicate inconsistent performance and a further decrease in branch coverage of 0.74% compared to UTGen. ...
To streamline bug detection by developers, we propose UTGenCov, a concept that focuses on improving the understandability of EvoSuite-generated tests without compromising on coverage. This approach builds upon UTGen by thoroughly analyzing the reasons behind the decrease in coverage and proposing an alternative approach.
Our investigation determined that the leading cause of coverage reduction in UTGen is LLM hallucination in the Understandability phase. UTGenCov aims to address hallucinations by providing the source code of the methods used in the test to the LLM. Yet, our experiment results indicate inconsistent performance and a further decrease in branch coverage of 0.74% compared to UTGen.
Readability Driven Test Selection
Using Large Language Models to Assign Readability Scores and Rank Auto-Generated Unit Tests
Language workbenches are tools for developing and deploying DSLs. They aim to reduce the investment that is required for DSLs and to improve the usability of the created DSLs. By lowering the investment, language workbenches can improve the opportunity for DSLs to be effective. Although much academic work has been published about the underlying technology and concepts of language workbenches, there is little empirical evidence on the actual impact of language workbenches in practice.
In this dissertation, we contribute such empirical evidence on the creation and evaluation of DSLs that are developed with language workbenches. We do so by conducting case studies in an industrial setting. This is important, as such empirical evidence can help others to determine whether to adopt DSLs developed with language workbenches. In particular, we use and evaluate Spoofax, a language workbench developed at the Delft University of Technology.
The context of our work is Canon Production Printing, a digital printing systems manufacturing company. Canon Production Printing provides a good environment for evaluating DSLs as they have obtained extensive domain knowledge for complex domains like modeling behavior, performance, and physical aspects of printing systems. We develop and evaluate DSLs for two of such domains. First, we develop CSX, a new DSL for the domain of configuration spaces of digital printing systems. Second, we reimplement OIL, an existing DSL for control software based on state machines. In both cases, we compare the newly created DSL with the existing situation.
For both cases, we draw generally positive conclusions. For example, in the CSX project, the DSL enables the use of constraint-solving technology which aids automatic and accurate configuration of printing systems, which can ultimately improve the quality, performance, and usability of printing systems. In the OIL project, we found that Spoofax is more than adequate for developing a complex DSL with industrial requirements and we found indications that it is more productive to develop a DSL with Spoofax compared to using a GPL and available libraries.
Our extensive case studies at Canon Production Printing have taught us valuable lessons and insights. In particular, to make good on the promise of DSLs in industry, language workbenches need to improve in terms of the non-functional aspects. We expect that improving on, e.g., portability, usability, and documentation will improve the impact of Spoofax on industrial DSL development.
...
Language workbenches are tools for developing and deploying DSLs. They aim to reduce the investment that is required for DSLs and to improve the usability of the created DSLs. By lowering the investment, language workbenches can improve the opportunity for DSLs to be effective. Although much academic work has been published about the underlying technology and concepts of language workbenches, there is little empirical evidence on the actual impact of language workbenches in practice.
In this dissertation, we contribute such empirical evidence on the creation and evaluation of DSLs that are developed with language workbenches. We do so by conducting case studies in an industrial setting. This is important, as such empirical evidence can help others to determine whether to adopt DSLs developed with language workbenches. In particular, we use and evaluate Spoofax, a language workbench developed at the Delft University of Technology.
The context of our work is Canon Production Printing, a digital printing systems manufacturing company. Canon Production Printing provides a good environment for evaluating DSLs as they have obtained extensive domain knowledge for complex domains like modeling behavior, performance, and physical aspects of printing systems. We develop and evaluate DSLs for two of such domains. First, we develop CSX, a new DSL for the domain of configuration spaces of digital printing systems. Second, we reimplement OIL, an existing DSL for control software based on state machines. In both cases, we compare the newly created DSL with the existing situation.
For both cases, we draw generally positive conclusions. For example, in the CSX project, the DSL enables the use of constraint-solving technology which aids automatic and accurate configuration of printing systems, which can ultimately improve the quality, performance, and usability of printing systems. In the OIL project, we found that Spoofax is more than adequate for developing a complex DSL with industrial requirements and we found indications that it is more productive to develop a DSL with Spoofax compared to using a GPL and available libraries.
Our extensive case studies at Canon Production Printing have taught us valuable lessons and insights. In particular, to make good on the promise of DSLs in industry, language workbenches need to improve in terms of the non-functional aspects. We expect that improving on, e.g., portability, usability, and documentation will improve the impact of Spoofax on industrial DSL development.
Augmenting Program Synthesis with Large Language Models
Incorporating Natural Language Understanding for Efficient Program Synthesis
The study aims to accumulate knowledge from real-life experiences shared on Stack Overflow and bridge the knowledge gap between industry practices and teaching practices.
The paper explores different types of software testing, popular frameworks, temporal trends of testing-related technologies, controversial opinions, and recommended practices/advice/suggestions from Stack Overflow posts. The methodology involves determining search terms through literature, querying the Stack Exchange API, conducting frequency analysis of words from posts, and manually inspecting threads. Our results show that the most popular frameworks discussed are Selenium, Spring, JMeter, and React. Automated testing and JavaScript frameworks have shown an upward trajectory over the years. The recommendations made by practitioners were categorized based on the broad scope of topics covered. We draw comparisons and parallels with related previous research and discuss the technical limitations faced during the study.
Overall, this paper uncovers valuable insights from Stack Overflow and provides practitioners with the current view on industry practices. ...
The study aims to accumulate knowledge from real-life experiences shared on Stack Overflow and bridge the knowledge gap between industry practices and teaching practices.
The paper explores different types of software testing, popular frameworks, temporal trends of testing-related technologies, controversial opinions, and recommended practices/advice/suggestions from Stack Overflow posts. The methodology involves determining search terms through literature, querying the Stack Exchange API, conducting frequency analysis of words from posts, and manually inspecting threads. Our results show that the most popular frameworks discussed are Selenium, Spring, JMeter, and React. Automated testing and JavaScript frameworks have shown an upward trajectory over the years. The recommendations made by practitioners were categorized based on the broad scope of topics covered. We draw comparisons and parallels with related previous research and discuss the technical limitations faced during the study.
Overall, this paper uncovers valuable insights from Stack Overflow and provides practitioners with the current view on industry practices.
Topic Analysis on Popular Software Testing Books
Mining Software Testing Knowledge