A.E. Zaidman
Please Note
120 records found
1
On the emergence of testing strategies
A socio-technical grounded theory
The qualitative factor in software testing
A systematic mapping study of qualitative methods
OIL
An industrial case study in language engineering with Spoofax
Not One to Rule Them All
Mining Meaningful Code Review Orders From GitHub
Developers use tools such as GitHub pull requests to review code, discuss proposed changes, and request modifications. While changed files are commonly presented in alphabetical order, this does not necessarily coincide with the reviewer's preferred navigation sequence. This study investigates the different navigation orders developers follow while commenting on changes submitted in pull requests. We mined code review comments from 23,241 pull requests in 100 popular Java and Python repositories on GitHub to analyze the order in which the reviewers commented on the submitted changes. Our analysis shows that for 44.6% of pull requests, the reviewers comment in a non-alphabetical order. Among these pull requests, we identified traces of alternative meaningful orders: 20.6% (2,134) followed a largest-diff first order, 17.6% (1,827) were commented in the order of the files' similarity to the pull request's title and description, and 29% (1,188) of pull requests containing changes to both production and test files adhered to a test-first order. We also observed that the proportion of reviewed files to total submitted files was significantly higher in non-alphabetically ordered reviews, which also received slightly fewer approvals from reviewers, on average. Our findings highlight the need for additional support during code reviews, particularly for larger pull requests, where reviewers are more likely to adopt complex strategies rather than following a single predefined order.
Mind the Gap
What Working With Developers on Fuzz Tests Taught Us About Coverage Gaps
Can fuzzers generate partial tests that developers find useful enough to complete into functional tests (e.g., by adding assertions)? To address this question, we develop a prototype within the Mozilla ecosystem and open 13 bug reports proposing partial generated tests for currently uncovered code. We found that the majority of the reactions focus on whether the targeted coverage gap is actually worth testing. To investigate further which coverage gaps developers find relevant to close, we design an automated filter to exclude irrelevant coverage gaps before generating tests. From conversations with 13 developers about whether the remaining coverage gaps are worth closing when a partially generated test is available, we learn that the filtering indeed removes clearly non-test-worthy gaps. The developers propose a variety of additional strategies to address the coverage gaps and how to make fuzz tests and reports more useful for developers.
Running a Red Light
An Investigation into Why Software Engineers (Occasionally) Ignore Coverage Checks
Many modern code coverage tools track and report code coverage data generated from running tests during continuous integration. They report code coverage data through a variety of channels, including email, Slack, Mattermost, or through the web interface of social coding platforms such as GitHub. In fact, this ensemble of tools can be configured in such a way that the software engineer gets a failing status check when code coverage drops below a certain threshold. In this study, we broadly investigate the opinions and experience with code coverage tools through a survey among 279 software engineers whose projects use the Codecov coverage tool and bot. In particular, we are investigating why software engineers would ignore a failing status check caused by drop in code coverage. We observe that >80% of software engineers-at least sometimes-ignore these failing status checks, and we get insights into the main reasons why software engineers ignore these checks.
To ensure the quality of software systems, software engineers can make use of a variety of quality assurance approaches, for example, software testing, modern code review, automated static analysis, and build automation. Each of these quality assurance practices have been studied in depth in isolation, but there is a clear knowledge gap when it comes to our understanding of how these approaches are being used in conjunction, or not. In our study, we broadly investigate whether and how these quality assurance approaches are being used in conjunction in the development of 1454 popular open source software projects on GitHub. Our study indicates that typically projects do not follow all quality assurance practices together with high intensity. In fact, we only observe weak correlation among some quality assurance practices. In general, our study provides a deeper understanding of how existing quality assurance approaches are currently being used in Java-based open source software development. Besides, we specifically zoom in on the more mature projects in our dataset, and generally we observe that more mature projects are more intense in their application of the quality assurance practices, with more focus on their ASAT usage, and code reviewing, but no strong change in their CI usage.
Software testing is a necessary aspect of software development. With high expectations placed on software testers and a shortage of qualified professionals, Massive Open Online Courses (MOOCs) have emerged as a potential solution to improve software testing education. MOOCs provide accessible education, bridging the gap between formal education and industry expectations. We investigate key aspects of and compare concepts of software testing MOOCs with university curricula and industry expectations. The findings show that a MOOC on average covers more concepts than a single university course. Additionally, MOOCs align well with what the industry expects from software testing practitioners.
Using GitHub Copilot for Test Generation in Python
An Empirical Study
Writing unit tests is a crucial task in software development, but it is also recognized as a time-consuming and tedious task. As such, numerous test generation approaches have been proposed and investigated. However, most of these test generation tools produce tests that are typically difficult to understand. Recently, Large Language Models (LLMs) have shown promising results in generating source code and supporting software engineering tasks. As such, we investigate the usability of tests generated by GitHub Copilot, a proprietary closed-source code generation tool that uses an LLM. We evaluate GitHub Copilot's test generation abilities both within and without an existing test suite, and we study the impact of different code commenting strategies on test generations.Our investigation evaluates the usability of 290 tests generated by GitHub Copilot for 53 sampled tests from open source projects. Our findings highlight that within an existing test suite, approximately 45.28% of the tests generated by Copilot are passing tests; 54.72% of generated tests are failing, broken, or empty tests. Furthermore, if we generate tests using Copilot without an existing test suite in place, we observe that 92.45% of the tests are failing, broken, or empty tests. Additionally, we study how test method comments influence the usability of test generations.
Scoping Software Engineering for AI
The TSE Perspective
As we have come to rely on software systems in our daily lives, we have a clear expectation about the reliability of these systems. To ensure this reliability, automated software quality assurance processes have become an important part of software development. However, given the climate crisis that we are witnessing, it is important to ask ourselves what the impact of all these automated quality assurance processes is in terms of electricity consumption. This study explores the electricity consumption and potential environmental impact of continuous integration and software testing in 10 open source software projects.
Shaken, Not Stirred
How Developers Like Their Amplified Tests
Test amplification makes systematic changes to existing, manually written tests to provide tests complementary to an automated test suite. We consider developer-centric test amplification, where the developer explores, judges and edits the amplified tests before adding them to their maintained test suite. However, it is as yet unclear which kind of selection and editing steps developers take before including an amplified test into the test suite. In this paper we conduct an open source contribution study, amplifying tests of open source Java projects from GitHub. We report which deficiencies we observe in the amplified tests while manually filtering and editing them to open 39 pull requests with amplified tests. We present a detailed analysis of the maintainer's feedback regarding proposed changes, requested information, and expressed judgment. Our observations provide a basis for practitioners to take an informed decision on whether to adopt developer-centric test amplification. As several of the edits we observe are based on the developer's understanding of the amplified test, we conjecture that developer-centric test amplification should invest in supporting the developer to understand the amplified tests.
Test case prioritization techniques have emerged as effective strategies to optimize this process and mitigate the regression testing costs. Commonly, black-box heuristics guide optimal test ordering, leveraging information retrieval (e.g., cosine distance) to measure the test case distance and sort them accordingly. However, a challenge arises when dealing with tests of varying granularity levels, as they may employ distinct vocabularies (e.g., name identifiers). In this paper, we propose to measure the distance between test cases based on the shortest path between their identifiers within the WordNet lexical database. This additional heuristic is combined with the traditional cosine distance to prioritize test cases in a multi-objective fashion. Our preliminary study conducted with two different Java projects shows that test cases prioritized with WordNet achieve larger fault detection capability (APFD C ) compared to the traditional cosine distance used in the literature.
Sentiment overflow in the testing stack
Analyzing software testing posts on Stack Overflow
Software testing is an integral part of modern software engineering practice. Past research has not only underlined its significance, but also revealed its multi-faceted nature. The practice of software testing and its adoption is influenced by many factors that go beyond tools or technology. This paper sets out to investigate the context of software testing from the practitioners’ point of view by mining and analyzing sentimental posts on the widely used question and answer website Stack Overflow. By qualitatively analyzing sentimental expressions of practitioners, which we extract from the Stack Overflow dataset using sentiment analysis tools, we discern factors that help us to better understand the lived experience of software engineers with regards to software testing. Grounded in the data that we have analyzed, we argue that sentiments like insecurity, despair and aspiration, have an impact on practitioners’ attitude towards testing. We suggest that they are connected to concrete factors like the level of complexity of projects in which software testing is practiced. Editor's note: Open Science material was validated by the Journal of Systems and Software Open Science Board.
Software testing is generally acknowledged to be an important weapon in the arsenal of software engineers to produce correct and reliable software systems. However, given the importance of the topic, little is known about where software engineers get their testing knowledge and skills from. Is this through (higher) education, training programmes in the industry, or rather is it self-taught? In this paper, we investigate the curricula of 100 highly ranked universities and survey 51 software engineers to shed light on the state-of-the-practice in software testing education, in terms of both academic education and education of software engineers in the industry.
Search-based approaches have been used in the literature to automate the process of creating unit test cases. However, related work has shown that generated tests with high code coverage could be ineffective, i.e., they may not detect all faults or kill all injected mutants. In this paper, we propose Cling, an integration-level test case generation approach that exploits how a pair of classes, the caller and the callee, interact with each other through method calls. In particular, Cling generates integration-level test cases that maximize the Coupled Branches Criterion (CBC). Coupled branches are pairs of branches containing a branch of the caller and a branch of the callee such that an integration test that exercises the former also exercises the latter. CBC is a novel integration-level coverage criterion, measuring the degree to which a test suite exercises the interactions between a caller and its callee classes. We implemented Cling and evaluated the approach on 140 pairs of classes from five different open-source Java projects. Our results show that (1) Cling generates test suites with high CBC coverage, thanks to the definition of the test suite generation as a many-objectives problem where each couple of branches is an independent objective; (2) such generated suites trigger different class interactions and can kill on average 7.7% (with a maximum of 50%) of mutants that are not detected by tests generated randomly or at the unit level; (3) Cling can detect integration faults coming from wrong assumptions about the usage of the callee class (25 for our subject systems) that remain undetected when using automatically generated random and unit-level test suites.
Projects on GitHub rely on the automation provided by software development bots. Nevertheless, the presence of bots can be annoying and disruptive to the community. Backed by multiple studies with practitioners, this article provides guidelines for developing and maintaining software bots.