Circular Image

A.E. Zaidman

info

Please Note

45 records found

Master thesis (2026) - S. Biennier, A.E. Zaidman
Automated testing is essential for software reliability, yet test code frequently contains test smells that degrade maintainability. Prior research has largely examined these issues through software-quality perspectives, leaving their environmental impact underexplored. This study investigates how refactoring test smells affects energy consumption, execution time, and test quality in Java JUnit suites.

We curated a dataset of open-source systems, detected smells with tsDetect and manually validated refactorable instances. For each instance, we applied literature-backed, smell-specific refactorings and measured energy with EnergiBridge under controlled conditions. The results show that energy effects are smell-specific. Removing Ignored test instances yields clear energy savings, whereas refactoring the Lazy test (JUnit 5) smell via @ParameterizedTest incurs substantial energy increases. Most other smells exhibit small or inconsistent changes. Interestingly, we found that changes in energy were also strongly coupled with changes in execution time, within our evaluation context (a controlled, CPU-bound, sequential JUnit setting).

Overall, this study extends test smell research into software sustainability and highlights trade-offs between maintainability and energy efficiency. It provides a reproducible measurement pipeline and empirical guidance on when refactoring test smells is likely to be energy-beneficial. ...
Master thesis (2026) - V.Y. Ning, A.E. Zaidman, Miroslav Zivkovic, M.A. Costea, Z. Erkin
Automated Static Analysis Tools (ASATs) generate a massive volume of non-actionable warnings. To address this, this thesis investigates the performance and resource trade-offs between classical Machine Learning (ML) models and Large Language Models (LLMs) for generating actionability probability scores. Utilizing the NASCAR dataset of over 1.2 million Java warnings, we evaluate optimized classical models (Random Forest and Logistic Regression) against the Claude 4.x LLM family using classification metrics (F1-score, AUC) and probabilistic calibration (Brier scores), supplemented by a qualitative user study of 15 industry professionals. Empirical results demonstrate that an optimized Random Forest yields superior predictive performance (F1-score: 76.85\%, AUC: 0.87) and reliable uncertainty calibration (Brier score: 0.1549), rendering the massive computational overhead of miscalibrated LLMs unnecessary. However, the user study identifies a human-AI feature disconnect: while the Random Forest relies heavily on historical metadata, developers universally demand source code context and severity indicators. Ultimately, an optimized Random Forest provides a significantly more efficient framework for scoring ASAT warnings, provided the scores are tightly coupled with the structural evidence required to sustain human trust. ...

A Tool for Energy-Aware Java Development

Master thesis (2026) - E. Mihalache, A.E. Zaidman, X. Liu, J. Yang
As the energy consumption of the ICT sector continues to grow, there is an increasing need for developers to reason about the energy efficiency of their code. However, most energy measurement tools operate at the application level and require significant workflow disruption, leaving developers without accessible, in-situ feedback during development. In this thesis, we investigate the technical and practical feasibility of a lightweight, software-based energy measurement tool for Java code snippets, implemented as an IntelliJ IDEA plugin backed by a JShell execution environment and the Linux powercap framework. To evaluate the proposed tool, we conduct a two-phase study. In a verification phase spanning 30 algorithmic problem pairs and 1800 measurements, the tool detects statistically significant energy consumption differences in 83.3% of cases. In a mixed-methods validation study with 22 participants, accuracy in identifying the more energy-efficient implementation rises from 56.8% when relying on source code inspection alone, to 97.7% when using the tool, alongside an increase in participant confidence. Qualitative analysis further reveals that the tool assists in correcting flawed intuitions and provides educational value. These results suggest that fine-grained, in-IDE energy measurement is both technically achievable and empirically beneficial, and constitutes a concrete step toward making energy-aware development a routine part of software engineering practice. ...
Master thesis (2026) - T. Sabău, A.E. Zaidman, B.A. Ardıç, J.G.H. Cockx
The integration of agentic artificial intelligence into software development workflows has introduced a new class of challenges for open-source software communities. As autonomous AI systems become capable of independently planning, implementing, and submitting code contributions, maintainers must deal with pull requests whose origin is not always disclosed and whose quality may not reflect sufficient human oversight. Despite growing community friction around this shift, evidenced by explicit AI contribution policies, controlled empirical studies of how maintainers actually respond to agentic AI contributions remain scarce.

This thesis investigates maintainer reception of agentic AI pull requests by actively submitting 90 pull requests to 45 open-source repositories across Python, TypeScript, and Java, targeting good first issues: tasks traditionally reserved for newcomers making their first contribution to a project. Contributions are structured along two dimensions: whether the repository has explicitly configured agentic AI tooling in its development workflow, and whether the use of AI assistance is disclosed in the pull request. This yields three contribution types, covering disclosed and undisclosed submissions to repositories without explicit AI configuration and disclosed submissions to repositories that have integrated agentic AI tooling. A mixed-methods approach is applied, combining quantitative analysis of acceptance rates and review activity with a qualitative thematic analysis of maintainer feedback.

The results show that acceptance rates differed across contribution types, with repositories that had explicitly integrated agentic AI tooling showing a statistically significantly lower acceptance rate compared to standard repositories with disclosed AI assistance, though not relative to the undisclosed group. Across all groups, staleness accounted for the majority of non-merged pull requests, suggesting that non-engagement was a more common outcome than active rejection. Disclosing AI assistance made no meaningful difference to acceptance rates within the same repository context. No statistically significant differences were found in the volume of reviews or comments across groups, although automated bots contributed a notable share of interactions, particularly in repositories with agentic tooling integration. Thematic analysis of maintainer feedback showed that code quality and implementation correctness were the dominant concerns across all groups, while explicit distrust of AI-generated contributions remained low. When maintainers did reject contributions on AI-related grounds, the concern was typically the degree of human oversight behind the submission rather than AI use itself. Several repositories also introduced or revised AI policies during the contribution period, reflecting how actively norms in this space are still evolving. ...
In recent years, GitHub Actions (GHA) has emerged as the leading platform for Continuous Integration and Continuous Deployment (CI/CD) within the GitHub ecosystem, offering developers seamless workflow automation. However, as with other CI/CD tools, GHA workflows are susceptible to ”smells” which are suboptimal practices that can lead to technical debt, reduced maintainability, and performance issues. This thesis investigates the prevalence and nature of these workflow smells in GHA configurations. Through an extensive analysis of commit histories from 83 projects, we identify common patterns of frequent changes in GHA workflows that may indicate the presence of smells. We propose a set of potential GHA-specific smells, develop a tool to automatically detect these smells, and validate our findings through a contribution study involving 40 pull requests to open-source projects. After qualitatively analysing the comments on 32 pull requests we settle on 7 confirmed GHA workflows smells, including one novel smell previously unrecognised in the literature, This work contributes to improving the quality of GHA workflows and offers insights for developers to optimise their CI/CD processes. Finally, this research was also accepted as a paper to the SCAM 2024 conference. ...

Enhancing Unit Test Understandability: An Evaluation of LLM-Generated Summaries

Bachelor thesis (2024) - N. Djajadi, A.E. Zaidman, A. Deljouyi, A. Katsifodimos
Since software testing is crucial, there has been research on generating test cases automatically. The problem is that the generated test cases can be hard to understand. Multiple factors play a role in understandability and one of them is test summarization, which provides an overview of the test of what it is exactly testing and sometimes highlights the key functionalities. There already exist some tools to generate test summaries that use template-based summarization techniques. Limitations of generated summaries include that they can be lengthy and redundant, and that it is best to use them in combination with well-defined test names and variables. There is a tool developed named UTGen, which combines Evosuite and Large Language models to increase understandability which includes improving the test names and variables, but does not have a summarization functionality yet. In this research, we extend UTGen using LLM-generated summaries. We investigate to what extent the understandability of a test case can be influenced by LLM-generated test summaries in terms of context, conciseness, and naturalness. For this reason, we do a user evaluation with 11 participants with a software testing background. They will judge LLM-generated summaries and compare them to existing summarization tools. The LLM-generated summaries scored overall higher than the template-based summaries and were also more favorable by the participants. ...

Enhancing Automated Software Testing with Runtime Data Integration

Automated software testing plays a critical role in improving software quality and reducing manual testing expenses. However, generating understandable and meaningful unit tests remains challenging, especially with frameworks optimized for coverage like Search-Based Software Testing (SBST). Large Language Models (LLMs) have the capability to generate human-like text, while capture/replay techniques can provide realistic data scenarios through trace logs, contributing to meaningful test case generation. This study introduces UTGen+, an approach that enhances LLM-based SBST by integrating trace logs from end-to-end tests, aiming to further improve test case understandability.
We conducted a comparative user study with 9 participants using UTGen+, original UTGen, and conventional SBST (EvoSuite), focusing on the effects of trace log inclusion on the naturalness and relevancy of comments, identifiers, and test data across several projects. The results indicated that while UTGen+ did not improve the naturalness and relevancy of comments and identifiers, it significantly enhanced the relevancy of test data. These findings suggest that incorporating contextual data can indeed benefit the generation of more relevant and understandable automated test cases. ...

Minimising the Need for Re-prompting in Automatic Understandable Test Generation

Automated test generation is the means to produce correct and usable code while maintaining an efficient and effective development process. UTGen is a tool that utilizes a Large Language Model (LLM) to improve the understandability of a test suite generated by a Search-Based Software Testing tool, namely EvoSuite. Often while the LLM attempts to improve a given test case, it generates code that is too far from the original, changing the test's purpose. Alternatively, it may generate code that does not compile. Such behaviour is called ``LLM Hallucination".

The current hallucination handling of UTGen is time-consuming and resource-expensive. To address this, we propose two alternative approaches that use information retrieval prompt engineering techniques to minimise hallucinations. Our respective techniques include incorporating the source code under test and the errors thrown by the latest generated test case to the LLM prompt. We assess our methods through a comparison study against the base UTGen version. We observe that source code retrieval enhances the generation of compilable test cases for complex classes. Error code retrieval shows similar hallucination performance to base UTGen, with a decrease in the number of re-prompts for classes with a high normalised Lack of Cohesion of Methods (*LCOM).

Index Terms - Automated Test Generation, Large Language Models (LLMs), LLM Hallucination, Prompt Engineering ...

A Study on the Ability of Large Language Models to Improve the Understandability of Generated Unit Tests Without Compromising Coverage

Automated software testing is a frequently studied topic in specialized literature. Search-based software testing tools, like EvoSuite, can generate test suites using genetic algorithms without the developer’s input. Large Language Models (LLMs) have recently attracted significant attention in the software engineering domain for their potential to automate test generation. UTGen, a tool integrating LLMs with EvoSuite, produces more understandable tests than EvoSuite; however, the generated tests suffer a coverage drop.

To streamline bug detection by developers, we propose UTGenCov, a concept that focuses on improving the understandability of EvoSuite-generated tests without compromising on coverage. This approach builds upon UTGen by thoroughly analyzing the reasons behind the decrease in coverage and proposing an alternative approach.

Our investigation determined that the leading cause of coverage reduction in UTGen is LLM hallucination in the Understandability phase. UTGenCov aims to address hallucinations by providing the source code of the methods used in the test to the LLM. Yet, our experiment results indicate inconsistent performance and a further decrease in branch coverage of 0.74% compared to UTGen. ...

Using Large Language Models to Assign Readability Scores and Rank Auto-Generated Unit Tests

Bachelor thesis (2024) - I. Zaidi, A. Deljouyi, A.E. Zaidman, A. Katsifodimos
Writing tests enhances quality, yet developers often deprioritize writing tests. Existing tools for automatic test generation face challenges in test under- standability. This is primarily due to the fact that these tools fail to consider the context, leading to the generation of identifiers, test names, and identifier data that are not contextually appropriate for the code they are testing. Current metrics for judging the understandability of unit tests are limited as they do not take into account contextual factors such as the quality of comments. Developing a metric to evaluate test readability is essential for selecting the most comprehensible tests. This research builds on UTGen, incorporating LLMs to enhance the readability of automatically generated unit tests. We developed a readability score and used LLMs to evaluate and rank tests, comparing these rankings with human evaluations. This research concludes that LLMs can successfully evaluate the readability of test cases. The GPT-4 Turbo Simple Prompt model exhibited the best performance, with a correlation of 0.7632 with human evaluations. Through comparing different LLMs and techniques for as- signing readability scores, we identified approaches that closely matched human evaluations, demonstrating that LLMs can successfully rate the read- ability of test cases. ...
Master thesis (2024) - R.F. Arntzenius, A.E. Zaidman, P. Pawelczak
Continuous Integration (CI) is a widely used quality-assurance measure within software development. It empowers developers to spot bugs and integration issues early in the development cycle and helps to maintain a coherent codebase, both in terms of quality and styling, even in open-source environments. But CI might have a hidden cost. Projects need to be built and tested continuously throughout the development cycle. It is not uncommon for projects to have multiple commits per day, reaching thousands of commits per year, with each commit having one or multiple build and test cycles. In this thesis, over 200 open-source Java projects were measured with the aim of making developers more aware of how much energy these builds can take and the measures that can be taken to reduce energy consumption where possible. ...
Master thesis (2024) - W. Oosterbroek, A.E. Zaidman, Maurits Elzinga, Robbert Jan Grootjans
In recent years Low-Code has seen a surge in popularity amongst companies to speed up their workflows. Yet, scientific work on Low-Code is still in its infancy. We set out to investigate the presence of anti-patterns within Low-Code applications. Given the typically less technically inclined nature of Low-Code developers, as well as the specific use cases of Low-Code in general, we expect that these anti-patterns differ from traditional programming languages. We apply a graph-based methodology to mine edit patterns across real-world commit data supplied to us by Mendix, one of the leading platforms in the Low-Code space. Additionally, we discuss the lack of current guidelines in the Low-Code field. While we are able to find common edit patterns using our approach, linking them to anti-patterns remains difficult in practice. We do establish that Low-Code in Mendix might lack reuse-ability and that the Low-Code often revolves around a few distinct tasks. However, there is a current lack of quality data available to properly assess the development practices of Low-Code developers and anti-patterns, increasing the availability of high-quality data is essential for further research in this area. ...
Doctoral thesis (2024) - J. Denkers, A.E. Zaidman, Jurgen J. Vinju
Tools used in software engineering often balance a tradeoff between generality and specificity. The most important tools in software engineering are programming languages, and the most common ones are General-Purpose Languages (GPLs). Because of their generality, GPLs can be used to develop many kinds of software. Domain-Specific Languages (DSLs) are a more specific counterpart; they are programming languages tailored to a specific domain. DSLs are not generally applicable but can be more effective for developing software within their particular domain. DSLs can be beneficial if their benefits outweigh the investments. In practice, it is hard to predict whether a DSL will be beneficial.

Language workbenches are tools for developing and deploying DSLs. They aim to reduce the investment that is required for DSLs and to improve the usability of the created DSLs. By lowering the investment, language workbenches can improve the opportunity for DSLs to be effective. Although much academic work has been published about the underlying technology and concepts of language workbenches, there is little empirical evidence on the actual impact of language workbenches in practice.

In this dissertation, we contribute such empirical evidence on the creation and evaluation of DSLs that are developed with language workbenches. We do so by conducting case studies in an industrial setting. This is important, as such empirical evidence can help others to determine whether to adopt DSLs developed with language workbenches. In particular, we use and evaluate Spoofax, a language workbench developed at the Delft University of Technology.

The context of our work is Canon Production Printing, a digital printing systems manufacturing company. Canon Production Printing provides a good environment for evaluating DSLs as they have obtained extensive domain knowledge for complex domains like modeling behavior, performance, and physical aspects of printing systems. We develop and evaluate DSLs for two of such domains. First, we develop CSX, a new DSL for the domain of configuration spaces of digital printing systems. Second, we reimplement OIL, an existing DSL for control software based on state machines. In both cases, we compare the newly created DSL with the existing situation.

For both cases, we draw generally positive conclusions. For example, in the CSX project, the DSL enables the use of constraint-solving technology which aids automatic and accurate configuration of printing systems, which can ultimately improve the quality, performance, and usability of printing systems. In the OIL project, we found that Spoofax is more than adequate for developing a complex DSL with industrial requirements and we found indications that it is more productive to develop a DSL with Spoofax compared to using a GPL and available libraries.

Our extensive case studies at Canon Production Printing have taught us valuable lessons and insights. In particular, to make good on the promise of DSLs in industry, language workbenches need to improve in terms of the non-functional aspects. We expect that improving on, e.g., portability, usability, and documentation will improve the impact of Spoofax on industrial DSL development.
...

Incorporating Natural Language Understanding for Efficient Program Synthesis

Master thesis (2023) - P.E.F. Klop, A.E. Zaidman, S. Dumančić, Gust Verbruggen
This research introduces a Language Model Augmented Program Synthesis (LMAPS) workflow to enhance traditional Programming by Example (PBE). PBE is a method to automatically generate a program that satisfies a specification that consists of a set of input-output examples. These program specifications are often defined by a few examples, which can lead to multiple programs that satisfy the given examples. In addition, PBE synthesisers have to explore a huge inefficient search space to solve these problems. The LMAPS workflow incorporates three components to overcome these limitations of PBE by using the language understanding capabilities of Large Language Models (LLM). LLMs can assist in generating a well-defined specification to mitigate the ambiguity issue inherent in PBE. The core component of LMAPS leverages the capabilities of LLMs to generate programs. These programs can be decomposed into building blocks to create a concise grammar for an inductive program synthesiser. This optimized grammar makes it able to synthesise correct programs at lower depths, make the workflow more efficient. LLMs can also aid in understanding the automatically generated programs, as these programs can be hard to interpret by humans. We compare LMAPS to a traditional PBE workflow in the task of synthesising regular expressions across four data sets. The results demonstrated that LMAPS can significantly reduce the search space for program synthesis and achieve up to 40% higher accuracy than PBE-only systems. Our research indicates that integrating LLMs into a typical PBE workflow shows significant improvements because of their combined strengths, resulting in a more accurate, efficient, and human-aligned workflow. ...
Software testing plays a crucial role in delivering reliable software. Currently, research is ongoing on how software developers and testers acquire this knowledge of software testing to deliver reliable software and what kind of knowledge is being transferred to developers and testers. In an effort to gain more insight into this area, we will focus on answering which software testing topics are being discussed in dedicated software testing courses and software engineering courses in top-ranked universities. Our findings show us that White-box testing, Black-box testing and the discussion of test levels are the most commonly discussed topics in universities. ...
This paper aims to unveil and gather testing-related information from Stack Overflow, highlighting it as a valuable resource for practitioners seeking answers and guidance.
The study aims to accumulate knowledge from real-life experiences shared on Stack Overflow and bridge the knowledge gap between industry practices and teaching practices.
The paper explores different types of software testing, popular frameworks, temporal trends of testing-related technologies, controversial opinions, and recommended practices/advice/suggestions from Stack Overflow posts. The methodology involves determining search terms through literature, querying the Stack Exchange API, conducting frequency analysis of words from posts, and manually inspecting threads. Our results show that the most popular frameworks discussed are Selenium, Spring, JMeter, and React. Automated testing and JavaScript frameworks have shown an upward trajectory over the years. The recommendations made by practitioners were categorized based on the broad scope of topics covered. We draw comparisons and parallels with related previous research and discuss the technical limitations faced during the study.
Overall, this paper uncovers valuable insights from Stack Overflow and provides practitioners with the current view on industry practices. ...
Software testing is a necessary aspect of software development. With high expectations placed on software testers and a shortage of qualified professionals, Massive Open Online Courses (MOOCs) have emerged as a potential solution to improve software testing education. MOOCs provide accessible education and can offer a comprehensive review of software testing principles and procedures, bridging the gap between formal education and industry expectations. A study of software testing MOOCs was conducted to examine key aspects and compare concepts with university curricula and industry expectations. The findings show that a MOOC on average covers more concepts than a single university course. Additionally, MOOCs align well with what the industry expects from software testing practitioners. Therefore, MOOCs can successfully contribute to software testing education and bridge the gap between university curricula and industry expectations. ...

Mining Software Testing Knowledge

In this study, we try to understand what kind of topics and frameworks are covered by the popular software testing books, and see whether these topics satisfy the industry needs and address the rising trends. To define "popular" software testing books, we formulated three heuristics. The topics of the books are analyzed through LDA topic modelling and manual inspection. LDA results inform us on the dominance of the topics within the whole corpus, while the manual inspection results show how often a topic is addressed. We combine the results of both of the methods to analyse the most noteworthy topics. We found that test automation, test design and planning, coverage analysis were the most frequently and extensively discussed topics in our corpus. We conclude that although the books cover some major topics that are demanded by the industry, there are also areas such as test management and usability testing, which are underrepresented. We also observed that the popular software testing books do not cover the rising software testing trends. While JUnit was the most discussed framework, in general the software testing books do not include practical information for specific frameworks or tools, but rather focus on the tool selection process. ...
As software and systems continue to get more complex, software testing is an important field to ensure that software functions properly. Every day information about software testing is being discussed on the internet via blog posts, discussion boards, and more. This information is scattered among many different websites, making it hard to access. To analyze software testing content published on the internet, newsletters curated by members of the field and reflective of industry trends were used. This analysis provides a broad overview of what software testing-related content is being discussed on the internet. Common problems discussed in newsletters include properly maintaining tests, working with and fixing flaky tests, and properly analyzing test results. Javascript and Typescript are the most popular programming languages discussed, while the web is also the most popular platform. When looking at test types, automated tests are frequently discussed, followed by end-to-end tests and unit tests. Common techniques and strategies discussed include API testing, the use of continuous integration, and the use of continuous deployment. Selenium, Cypress, and the Gherkin syntax are the most frequently discussed tools and technologies. Finally, opinionated articles tend to be most common, followed by articles that introduce a technology and articles that explain a concept. ...
Master thesis (2023) - A.J.H. Sterk, A.E. Zaidman, Mairieli Wessel, R. Hai, E Hooten
Software development has increasingly become an activity that is (partially) done online on open-source platforms such as GitHub, and with it, so have the tools developers typically use. One such category of tools is that of code coverage tools. These tools track and report coverage data generated during CI tests. As the adoption of these tools has grown, so does the amount of available coverage data. In this thesis we explore a large database of coverage data from Codecov, a popular coverage tool. What sets our work apart from existing research is that it spans a large number of projects which vary in size, language, and domain. Furthermore, we conduct a survey, which was disseminated among a wide variety of open-source developers, instead of at a single company or in an enterprise setting. Our research consists of three parts. Firstly, we assess whether there is a relationship between the time to merge a PR and its coverage levels. We find that such a relationship does exist in certain projects. Secondly, we look at the impact of PR comments mentioning coverage on the odds of said coverage improving. Using the odds ratio test, we conclude that there are greater odds of coverage improving when it is mentioned than when it is not. Thirdly, we conduct a survey to ask developers their reasons for ignoring a failing status check related to code coverage. Some reasons they give are the complexity of testing, the triviality of the proposed changes, or the pull request being too important to wait for proper testing. Furthermore, respondents who identify as code contributors find themselves twice more likely to find fixing coverage a waste of their time than those who identify as code maintainers, while code maintainers are more concerned with not scaring away new contributors with strict coverage guidelines. ...