X.D.M. Devroey | TU Delft Repository

Basic block coverage for search-based unit testing and crash reproduction

Journal article (2022) - Pouria Derakhshanfar, Xavier Devroey, Andy Zaidman

Search-based techniques have been widely used for white-box test generation. Many of these approaches rely on the approach level and branch distance heuristics to guide the search process and generate test cases with high line and branch coverage. Despite the positive results achieved by these two heuristics, they only use the information related to the coverage of explicit branches (e.g., indicated by conditional and loop statements), but ignore potential implicit branchings within basic blocks of code. If such implicit branching happens at runtime (e.g., if an exception is thrown in a branchless-method), the existing fitness functions cannot guide the search process. To address this issue, we introduce a new secondary objective, called Basic Block Coverage (BBC), which takes into account the coverage level of relevant basic blocks in the control flow graph. We evaluated the impact of BBC on search-based unit test generation (using the DynaMOSA algorithm) and search-based crash reproduction (using the STDistance and WeightedSum fitness functions). Our results show that for unit test generation, BBC improves the branch coverage of the generated tests. Although small (∼ 1.5%), this improvement in the branch coverage is systematic and leads to an increase of the output domain coverage and implicit runtime exception coverage, and of the diversity of runtime states. In terms of crash reproduction, in the combination of STDistance and WeightedSum, BBC helps in reproducing 3 new crashes for each fitness function. BBC significantly decreases the time required to reproduce 43.5% and 45.1% of the crashes using STDistance and WeightedSum, respectively. For these crashes, BBC reduces the consumed time by 71.7% (for STDistance) and 68.7% (for WeightedSum) on average. ...

JUGE: An Infrastructure for Benchmarking Java Unit Test Generators

Journal article (2022) - Xavier Devroey, Alessio Gambi, Juan Pablo Galeotti, René Just, Fitsum Meshesha Kifetew, Sebastiano Panichella, A. Panichella

Researchers and practitioners have designed and implemented various automated test case generators to support effective software testing. Such generators exist for various languages (e.g., Java, C#, or Python) and various platforms (e.g., desktop, web, or mobile applications). The generators exhibit varying effectiveness and efficiency, depending on the testing goals they aim to satisfy (e.g., unit-testing of libraries versus system-testing of entire applications) and the underlying techniques they implement. In this context, practitioners need to be able to compare different generators to identify the most suited one for their requirements, while researchers seek to identify future research directions. This can be achieved by systematically executing large-scale evaluations of different generators. However, executing such empirical evaluations is not trivial and requires substantial effort to select appropriate benchmarks, setup the evaluation infrastructure, and collect and analyse the results. In this Software Note, we present our JUnit Generation Benchmarking Infrastructure (JUGE) supporting generators (search-based, random-based, symbolic execution, etc.) seeking to automate the production of unit tests for various purposes (validation, regression testing, fault localization, etc.). The primary goal is to reduce the overall benchmarking effort, ease the comparison of several generators, and enhance the knowledge transfer between academia and industry by standardizing the evaluation and comparison process. Since 2013, several editions of a unit testing tool competition, co-located with the Search-Based Software Testing Workshop, have taken place where JUGE was used and evolved. As a result, an increasing amount of tools (over 10) from academia and industry have been evaluated on JUGE, matured over the years, and allowed the identification of future research directions. Based on the experience gained from the competitions, we discuss the expected impact of JUGE in improving the knowledge transfer on tools and approaches for test generation between academia and industry. Indeed, the JUGE infrastructure demonstrated an implementation design that is flexible enough to enable the integration of additional unit test generation tools, which is practical for developers and allows researchers to experiment with new and advanced unit testing tools and approaches. ...

Researchers and practitioners have designed and implemented various automated test case generators to support effective software testing. Such generators exist for various languages (e.g., Java, C#, or Python) and various platforms (e.g., desktop, web, or mobile applications). The generators exhibit varying effectiveness and efficiency, depending on the testing goals they aim to satisfy (e.g., unit-testing of libraries versus system-testing of entire applications) and the underlying techniques they implement. In this context, practitioners need to be able to compare different generators to identify the most suited one for their requirements, while researchers seek to identify future research directions. This can be achieved by systematically executing large-scale evaluations of different generators. However, executing such empirical evaluations is not trivial and requires substantial effort to select appropriate benchmarks, setup the evaluation infrastructure, and collect and analyse the results. In this Software Note, we present our JUnit Generation Benchmarking Infrastructure (JUGE) supporting generators (search-based, random-based, symbolic execution, etc.) seeking to automate the production of unit tests for various purposes (validation, regression testing, fault localization, etc.). The primary goal is to reduce the overall benchmarking effort, ease the comparison of several generators, and enhance the knowledge transfer between academia and industry by standardizing the evaluation and comparison process. Since 2013, several editions of a unit testing tool competition, co-located with the Search-Based Software Testing Workshop, have taken place where JUGE was used and evolved. As a result, an increasing amount of tools (over 10) from academia and industry have been evaluated on JUGE, matured over the years, and allowed the identification of future research directions. Based on the experience gained from the competitions, we discuss the expected impact of JUGE in improving the knowledge transfer on tools and approaches for test generation between academia and industry. Indeed, the JUGE infrastructure demonstrated an implementation design that is flexible enough to enable the integration of additional unit test generation tools, which is practical for developers and allows researchers to experiment with new and advanced unit testing tools and approaches.

Basic Block Coverage for Unit Test Generation at the SBST 2022 Tool Competition

Conference paper (2022) - Pouria Derakhshanfar, Xavier Devroey

Basic Block Coverage (BBC) is a secondary objective for search-based unit test generation techniques relying on the approach level and branch distance to drive the search process. Unlike the approach level and branch distance, which considers only information related to the coverage of explicit branches coming from conditional and loop statements, BBC also takes into account implicit branchings (e.g., a runtime exception thrown in a branchless method) denoted by the coverage level of relevant basic blocks in a control flow graph to drive the search process. Our implementation of BBC for unit test generation relies on the DynaMOSA algorithm and EvoSuite. This paper summarizes the results achieved by EvoSuite's DynaMOSA implementation with BBC as a secondary objective at the SBST 2022 unit testing tool competition. ...

VaryMinions: Leveraging RNNs to Identify Variants in Event Logs

Conference paper (2021) - Sophie Fortz, Paul Temple, Xavier DEVROEY, Patrick HEYMANS, GILLES PERROUIN

Business processes have to manage variability in their execution, e.g., to deliver the correct building permit in different municipalities. This variability is visible in event logs, where sequences of events are shared by the core process (building permit authorisation) but may also be specific to each municipality. To rationalise resources (e.g., derive a configurable business process capturing all municipalities' permit variants) or to debug anomalous behaviour, it is mandatory to identify to which variant a given trace belongs. This paper supports this task by training Long Short Term Memory (LSTMs) and Gated Recurrent Units (GRUs) algorithms on two datasets: a configurable municipality and a travel expenses workflow. We demonstrate that variability can be identified accurately (>87%) and discuss the challenges of learning highly entangled variants. ...

Summary of Search-based Crash Reproduction using Behavioral Model Seeding

Conference paper (2021) - P. Derakhshanfar, Xavier Devroey, Gilles Perrouin, A.E. Zaidman, A. van Deursen

This is an extended abstract of the article: Pouria Derakhshanfar, Xavier Devroey, Gilles Perrouin, Andy Zaidman and Arie van Deursen. 2019. Search-based crash reproduction using behavioural model seeding. In: Software Testing, Verification and Reliability (May 2020). http://doi.org/10.1002/stvr.1733. ...

Experience Report on Soft and Project Skills Building through Repetition

Conference paper (2021) - Xavier Devroey, Moussa Amrani, Benoît Vanderose

Acquiring soft and project skills during their studies is of paramount importance for computer science students to integrate large development teams after graduating. Project-oriented learning offers interesting opportunities for teachers to tutor students, and allows them to acquire and train those skills in addition to the core topics of the course. However, since most existing curricula require courses to be as independent as possible (for organizational reasons for instance), some topics are covered in different courses in slightly different ways. This repetition is interesting for understanding difficult notions appropriately, but may also hamper students' understanding when closely related concepts are embedded in different ways. We report here on our teaching approach: we propose a series of projects that share a common theme, in order to (i) provide a transversal understanding of common notions seen in separate courses, and (ii) introduce soft and project skills in a progressive way, enabling students to iteratively experience and learn skills that are necessary for professional life. We report on the results of interviews conducted with the students and extract valuable lessons for reproducing this approach in different curricula. ...

Crash reproduction difficulty, an initial assessment

Journal article (2020) - Boris Cherry, Xavier Devroey, Pouria Derakhshanfar, Benoît Vanderose

This study presents the initial step towards a thorough analysis of the difficulty to reproduce a crash using searchbased crash reproduction. Traditionally, code size and complexity are considered representative indicators of the difficulty for search-based approaches, like search-based unit test generation, to generate tests. However, unlike unit test generation, crash reproduction does not seek to cover a set of behaviors but instead to generate one or more tests exercising a specific behavior reproducing a given crash. In this context, there is no guarantee that the indicators used for unit testing are still valid for crash reproduction. In this study, we seek to identify such indicators by considering various code metrics, code smells, and change metrics. We report our effort to collect those metrics for JCRASHPACK, a state-of-the-art crash reproduction benchmark, and an initial assessment by considering metrics individually. Our results show that although JCRASHPACK is larger than benchmarks used in previous studies, additional crashes should be added to improve its diversity and representativeness, and that no individual metric can be used to characterize the difficulty to reproduce a crash. ...

Commonality-Driven Unit Test Generation

Conference paper (2020) - B. Evers, Pouria Derakhshanfar, Xavier Devroey, Andy Zaidman

Various search-based test generation techniques have been proposed to automate the generation of unit tests fulfilling different criteria (e.g., line coverage, branch coverage, mutation score, etc.). Despite several advances made over the years, search-based unit test generation still suffers from a lack of guidance due to the limited amount of information available in the source code that, for instance, hampers the generation of complex objects. Previous studies introduced many strategies to address this issue, e.g., dynamic symbolic execution or seeding, but do not take the internal execution of the methods into account. In this paper, we introduce a novel secondary objective called commonality score, measuring how close the execution path of a test case is from reproducing a common or uncommon execution pattern observed during the operation of the software. To assess the commonality score, we implemented it in EvoSuite and evaluated its application on 150 classes from JabRef, an open-source software for managing bibliography references. Our results are mixed. Our approach leads to test cases that indeed follow common or uncommon execution patterns. However, if the commonality score can have a positive impact on the structural coverage and mutation score of the generated test suites, it can also be detrimental in some cases. ...

It is not Only About Control Dependent Nodes: Basic Block Coverage for Search-Based Crash Reproduction

Conference paper (2020) - Pouria Derakhshanfar, Xavier Devroey, Andy Zaidman

Search-based techniques have been widely used for white-box test generation. Many of these approaches rely on the approach level and branch distance heuristics to guide the search process and generate test cases with high line and branch coverage. Despite the positive results achieved by these two heuristics, they only use the information related to the coverage of explicit branches (e.g., indicated by conditional and loop statements), but ignore potential implicit branchings within basic blocks of code. If such implicit branching happens at runtime (e.g., if an exception is thrown in a branchless-method), the existing fitness functions cannot guide the search process. To address this issue, we introduce a new secondary objective, called Basic Block Coverage (BBC), which takes into account the coverage level of relevant basic blocks in the control flow graph. We evaluated the impact of BBC on search-based crash reproduction because the implicit branches commonly occur when trying to reproduce a crash, and the search process needs to cover only a few basic blocks (i.e., blocks that are executed before crash happening). We combined BBC with existing fitness functions (namely STDistance and WeightedSum) and ran our evaluation on 124 hard-to-reproduce crashes. Our results show that BBC, in combination with STDistance and WeightedSum, reproduces 6 and 1 new crashes, respectively. BBC significantly decreases the time required to reproduce 26.6% and 13.7% of the crashes using STDistance and WeightedSum, respectively. For these crashes, BBC reduces the consumed time by 44.3% (for STDistance) and 40.6% (for WeightedSum) on average. ...

An Application of Model Seeding to Search-based Unit Test Generation for Gson

Conference paper (2020) - Mitchell Olsthoorn, Pouria Derakhshanfar, Xavier Devroey

Model seeding is a strategy for injecting additional information in a search-based test generation process in the form of models, representing usages of the classes of the software under test. These models are used during the search-process to generate logical sequences of calls whenever an instance of a specific class is required. Model seeding was originally proposed for search-based crash reproduction. We adapted it to unit test generation using EvoSuite and applied it to GSON, a Java library to convert Java objects from and to JSON. Although our study shows mixed results, it identifies potential future research directions. ...

Extended abstract: Test them all, is it worth it? Assessing configuration sampling on the JHipster Web development stack

Conference paper (2020) - Axel Halin, Alexandre Nuttinck, Mathieu Acher, Xavier Devroey, Gilles Perrouin, Benoit Baudry

This is an extended abstract of the article: Axel Halin, Alexandre Nuttinck, Mathieu Acher, Xavier Devroey, Gilles Perrouin, and Benoit Baudry. 2018. Test them all, is it worth it? Assessing config- uration sampling on the JHipster Web development stack. In Em- pirical Software Engineering (17 Jul 2018). https://doi.org/10.1007/ s10664-018-9635-4. ...

MALTESQUE 2019 Workshop Summary

Journal article (2020) - Francesca Arcelli Fontana, Gilles Perrouin, Apostolos Ampatzoglou, Mathieu Archer, Bartosz Walter, Maxime Cordy, Fabio Palomba, Xavier Devroey

Report from the 1st Int. Workshop on Education through Advanced Software Engineering and Artificial Intelligence (EASEAI ’19)

Journal article (2020) - Benoît Vanderose, Benoît Frenay, Julie Henry, Xavier Devroey

In the past years, everyday life has been profoundly transformed by the development and widespread of digital technologies. General, as well as specialized audiences, have to face an ever-increasing amount of knowledge and learn new abilities. This first edition of the EASEAI workshop tried to address that challenge by looking at software engineering, education, and artificial intelligence research fields to explore how they can be combined. Specifically, we brought together researchers, teachers, and practitioners who use advanced software engineering tools and artificial intelligence techniques in education. And researchers and teachers in education science who address the problem of improving the awareness regarding digital technologies through a transgenerational and transdisciplinary range of students. ...

A benchmark-based evaluation of search-based crash reproduction

Journal article (2020) - Mozhan Soltani, Pouria Derakhshanfar, Xavier Devroey, Arie van Deursen

Crash reproduction approaches help developers during debugging by generating a test case that reproduces a given crash. Several solutions have been proposed to automate this task. However, the proposed solutions have been evaluated on a limited number of projects, making comparison difficult. In this paper, we enhance this line of research by proposing JCrashPack, an extensible benchmark for Java crash reproduction, together with ExRunner, a tool to simply and systematically run evaluations. JCrashPack contains 200 stack traces from various Java projects, including industrial open source ones, on which we run an extensive evaluation of EvoCrash, the state-of-the-art tool for search-based crash reproduction. EvoCrash successfully reproduced 43% of the crashes. Furthermore, we observed that reproducing NullPointerException, IllegalArgumentException, and IllegalStateException is relatively easier than reproducing ClassCastException, ArrayIndexOutOfBoundsException and StringIndexOutOfBoundsException. Our results include a detailed manual analysis of EvoCrash outputs, from which we derive 14 current challenges for crash reproduction, among which the generation of input data and the handling of abstract and anonymous classes are the most frequents. Finally, based on those challenges, we discuss future research directions for search-based crash reproduction for Java. ...

Botsing, a Search-based Crash Reproduction Framework for Java

Conference paper (2020) - Pouria Derakhshanfar, Xavier Devroey, Annibale Panichella, Andy Zaidman, Arie van Deursen

Approaches for automatic crash reproduction aim to generate test cases that reproduce crashes starting from the crash stack traces. These tests help developers during their debugging practices. One of the most promising techniques in this research field leverages search-based software testing techniques for generating crash reproducing test cases. In this paper, we introduce Botsing, an open-source search-based crash reproduction framework for Java. Botsing implements state-of-the-art and novel approaches for crash reproduction. The well-documented architecture of Botsing makes it an easy-to-extend framework, and can hence be used for implementing new approaches to improve crash reproduction. We have applied Botsing to a wide range of crashes collected from open source systems. Furthermore, we conducted a qualitative assessment of the crash-reproducing test cases with our industrial partners. In both cases, Botsing could reproduce a notable amount of the given stack traces.
Demo. video: https://www.youtube.com/watch?v=k6XaQjHqe48
Botsing website: https://stamp-project.github.io/botsing/ ...

Welcome from the Chairs

Foreword postscript (2020) - Benoît Vanderose, Julie Henry, Benoît Frenay, Xavier Devroey

Welcome to the second edition of the International Workshop on Education
through Advanced Software Engineering and Artificial Intelligence (EASEAI 2020)
to be held virtually, November 9, 2020, co-located with ESEC/FSE 2020.
In the past years, with the development and widespread of digital technologies,
everyday life has been profoundly transformed. The general public, as well as
specialized audiences, have to face an ever-increasing amount of knowledge and
learn new abilities. The EASEAI workshop addresses that challenge by looking
at software engineering, education, and artificial intelligence research fields to
explore how they can be combined. Specifically, this workshop brings together
researchers, teachers, and practitioners who use advanced software engineering
tools and artificial intelligence techniques in the education field and through a
transgenerational and transdisciplinary range of students to discuss the current
state of the art and practices, and establish new future directions. In total, EASEAI 2020 received 13 submissions, out of which 5 papers were accepted after a thorough reviewprocess. Three members of the program committee reviewed each submission. We also received two presentation abstracts, both selected for presentation during the workshop. We sincerely thank the program committee members, authors, and participants who will make EASEAI an exciting and successful event! ...

Java Unit Testing Tool Competition - Eighth Round

Conference paper (2020) - Xavier Devroey, Sebastiano Panichella, Alessio Gambi

We report on the results of the eighth edition of the Java unit testing tool competition. This year, two tools, EvoSuite and Randoop, were executed on a benchmark with (i) new classes under test, selected from open-source software projects, and (ii) the set of classes from one project considered in the previous edition. We relied on an updated infrastructure for the execution of the different tools and the subsequent coverage and mutation analysis based on Docker containers. We considered two different time budgets for test case generation: one an three minutes. This paper describes our method- ology and statistical analysis of the results, presents the results achieved by the contestant tools and highlights the challenges we faced during the competition. ...

Pandemic programming: How COVID-19 affects software developers and how their organizations can help

Journal article (2020) - Paul Ralph, Sebastian Baltes, Minghui Zhou, Burak Turhan, Rashina Hoda, Hideaki Hata, Gregorio Robles, Amin Milani Fard, Rana Alkadhi, Gianisa Adisaputri, Richard Torkar, Vladimir Kovalenko, Marcos Kalinowski, Nicole Novielli, Shin Yoo, Xavier Devroey, Xin Tan

Context: As a novel coronavirus swept the world in early 2020, thousands of software developers began working from home. Many did so on short notice, under difficult and stressful conditions.

Objective: This study investigates the effects of the pandemic on developers’ wellbeing and productivity.

Method: A questionnaire survey was created mainly from existing, validated scales and translated into 12 languages. The data was analyzed using non-parametric inferential statistics and structural equation modeling.

Results: The questionnaire received 2225 usable responses from 53 countries. Factor analysis supported the validity of the scales and the structural model achieved a good fit (CFI = 0.961, RMSEA = 0.051, SRMR = 0.067). Confirmatory results include: (1) the pandemic has had a negative effect on developers’ wellbeing and productivity; (2) productivity and wellbeing are closely related; (3) disaster preparedness, fear related to the pandemic and home office ergonomics all affect wellbeing or productivity. Exploratory analysis suggests that: (1) women, parents and people with disabilities may be disproportionately affected; (2) different people need different kinds of support.

Conclusions: To improve employee productivity, software companies should focus on maximizing employee wellbeing and improving the ergonomics of employees’ home offices. Women, parents and disabled persons may require extra support. ...

Crash Reproduction Using Helper Objectives

Conference paper (2020) - Pouria Derakhshanfar, Xavier Devroey, Andy Zaidman, Arie van Deursen, Annibale Panichella

Evolutionary-based crash reproduction techniques aid developers in their debugging practices by generating a test case that reproduces a crash given its stack trace. In these techniques, the search process is typically guided by a single search objective called Crash Distance. Previous studies have shown that current approaches could only reproduce a limited number of crashes due to a lack of diversity in the population during the search. In this study, we address this issue by applying Multi-Objectivization using Helper-Objectives (MO-HO) on crash reproduction. In particular, we add two helper-objectives to the Crash Distance to improve the diversity of the generated test cases and consequently enhance the guidance of the search process. We assessed MO-HO against the single-objective crash reproduction. Our results show that MO-HO can reproduce two additional crashes that were not previously reproducible by the single-objective approach. ...

Good Things Come In Threes

Improving Search-based Crash Reproduction With Helper Objectives

Conference paper (2020) - Pouria Derakhshanfar, Xavier Devroey, Andy Zaidman, Arie van Deursen, Annibale Panichella

Writing a test case reproducing a reported software crash is a common practice to identify the root cause of an anomaly in the software under test. However, this task is usually labor-intensive and time-taking. Hence, evolutionary intelligence approaches have been successfully applied to assist developers during debugging by generating a test case reproducing reported crashes. These approaches use a single fitness function called Crash Distance to guide the search process toward reproducing a target crash. Despite the reported achievements, these approaches do not always successfully reproduce some crashes due to a lack of test diversity (premature convergence). In this study, we introduce a new approach, called MO-HO, that addresses this issue via multi-objectivization. In particular, we introduce two new Helper-Objectives for crash reproduction, namely test length (to minimize) and method sequence diversity (to maximize), in addition to Crash Distance. We assessed MOHO using five multi-objective evolutionary algorithms (NSGA-II, SPEA2, PESA-II, MOEA/D, FEMO) on 124 non-trivial crashes stemming from open-source projects. Our results indicate that SPEA2 is the best-performing multi-objective algorithm for MO-HO. We evaluated this best-performing algorithm for MO-HO against the state-of-the-art: single-objective approach (Single-Objective Search) and decomposition-based multi-objectivization approach (De-MO). Our results show that MO-HO reproduces five crashes that cannot be reproduced by the current state-of-the-art. Besides, MO-HO improves the effectiveness (+10% and +8% in reproduction ratio) and the efficiency in 34.6% and 36% of crashes (i.e., significantly lower running time) compared to Single-Objective Search and De-MO, respectively. For some crashes, the improvements are very large, being up to +93.3% for reproduction ratio and -92% for the required running time. ...