AP

A. Panichella

info

Please Note

109 records found

Multi-objective Testing of Deep Reinforcement Learning Agents

Conference paper (2026) - Antony Bartlett, Cynthia Liem, Annibale Panichella
Testing deep reinforcement learning (DRL) agents in safety-critical domains requires discovering diverse failure scenarios. Existing tools such as INDAGO rely on single-objective optimization focused solely on maximizing failure counts, but this does not ensure discovered scenarios are diverse or reveal distinct error types. We introduce INDAGO-Nexus, a multi-objective search approach that jointly optimizes for failure likelihood and test scenario diversity using multi-objective evolutionary algorithms with multiple diversity metrics and Pareto front selection strategies. We evaluated INDAGO-Nexus on three DRL agents: humanoid walker, self-driving car, and parking agent. On average, INDAGO-Nexus discovers up to 83% and 40% more unique failures (test effectiveness) than INDAGO in the SDC and Parking scenarios, respectively, while reducing time-to-failure by up to 67% across all agents. ...
Conference paper (2025) - A. Panichella
Knowledge distillation compresses large language models (LLMs) into more compact and efficient versions that achieve similar accuracy on code-related tasks. However, as we demonstrate in this study, compressed models are four times less robust than the original LLMs when evaluated with metamorphic code. They exhibit a 440% higher probability of misclassifying code clones due to minor changes in the code fragment under analysis, such as replacing parameter names with synonyms. To address this issue, we propose Morph, a novel method that combines metamorphic testing with many-objective optimization for a robust distillation of LLMs for code. Morph efficiently explores the models' configuration space and generates Paretooptimal models that effectively balance accuracy, efficiency, and robustness to metamorphic code. Metamorphic testing measures robustness as the number of code fragments for which a model incorrectly makes different predictions between the original and their equivalent metamorphic variants (prediction flips). We evaluate Morph on two tasks-code clone and vulnerability detection-targeting CodeBERT and GraphCodeBERT for distillation. Our comparison includes Morph, the state-of-theart distillation method AVATAR, and the fine-tuned non-distilled LLMs. Compared to Avatar, Morph produces compressed models that are (i) 47% more robust, (ii) 25% more efficient (fewer floating-point operations), while maintaining (iii) equal or higher accuracy (up to +6%), and (iv) similar model size. ...
Conference paper (2025) - A.J. Bartlett, C.C.S. Liem, A. Panichella
Resolving Python dependency issues remains a tedious and error-prone process, forcing developers to manually trial compatible module versions and interpreter configurations. Existing automated solutions, such as knowledge-graph-based and database-driven methods, face limitations due to the variety of dependency error types, large sets of possible module versions, and conflicts among transitive dependencies. This paper investigates the use of Large Language Models (LLMs) to automatically repair dependency issues in Python programs. We propose pllm (pronounced 'plum'), a novel retrieval-augmented generation (RAG) approach that iteratively infers missing or incorrect dependencies. PLLM builds a test environment where the LLM proposes module combinations, observes execution feedback, and refines its predictions using natural language processing (NLP) to parse error messages. We evaluate PLLM on the Gistable HG2. 9K dataset, a curated collection of real-world Python programs. Using this benchmark, we explore multiple PLLM configurations, including six open-source LLMs evaluated both with and without RAG. Our findings show that RAG consistently improves fix rates, with the best performance achieved by Gemma-2 9B when combined with RAG. Compared to two state-of-the-art baselines, PyEGo and ReadPyE, PLLM achieves significantly higher fix rates; +15.97% more than ReadPyE and +21.58% more than PyEGo. Further analysis shows that PLLM is especially effective for projects with numerous dependencies and those using specialized numerical or machine-learning libraries. ...
Byzantine fault tolerant algorithms are critical for achieving consistency and reliability in distributed systems, especially in the presence of faults or adversarial behavior. The consensus algorithm used by the XRP Ledger falls into this category. In practice, the implementation of these algorithms is prone to errors, which can lead to undesired behavior in the system. This paper introduces Rocket, a fuzz-testing framework designed for the XRPL consensus algorithm. Rocket enables researchers and developers to automatically inject network and process faults into a locally simulated network of XRPL validator nodes to test if the system behaves as expected. This technique has previously been shown to be effective in finding implementation errors. Rocket has been designed to focus on extensibility and ease of use, enabling users to run complex test scenarios with minimal setup. Video: https://www.youtube.com/watch?v=07Z3ufRa51Y ...
Conference paper (2025) - C.S. Cao, A. Panichella, S.E. Verwer
The rising popularity of the microservice architectural style has led to a growing demand for automated testing approaches tailored to these systems. EvoMaster is a state-of-the-art tool that uses Evolutionary Algorithms (EAs) to automatically generate test cases for microservices’ REST APIs. One limitation of these EAs is the use of unit-level search heuristics, such as branch distances, which focus on fine-grained code coverage and may not effectively capture the complex, interconnected behaviors characteristic of system-level testing. To address this limitation, we propose a new search heuristic (MISH) that uses real-time automaton learning to guide the test case generation process. We capture the sequential call patterns exhibited by a test case by learning an automaton from the stream of log events outputted by different microservices within the same system. Therefore, MISH learns a representation of the systemwide behavior, allowing us to define the fitness of a test case based on the path it traverses within the inferred automaton. We empirically evaluate MISH’s effectiveness on six real-world benchmark microservice applications and compare it against a state-of-the-art technique, MOSA, for testing REST APIs. Our evaluation shows promising results for using MISH to guide the automated test case generation within EvoMaster. ...
Conference paper (2025) - G. Dalia, A. Panichella, Andrea Di Sorbo, Gerardo Canfora, Corrado A. Visaggio
Although the study of software transparency has deep roots in software engineering, a shared definition and practical application in real-world development contexts remain elusive. Through an in-depth analysis of the academic and industrial landscape, this article provides an overview of the current state of knowledge on software transparency, outlining a path to a deeper understanding of the subject for both developers and researchers. The challenge of software transparency involves not only establishing a formal, widely accepted understanding within the community, but also measuring and quantifying it in production environments. To this end, we survey academics and developers to evaluate an innovative approach to defining transparency and present a vision of a new framework for its quantification. ...
Conference paper (2025) - A. Bartlett, C. Liem, A. Panichella
DRVN is a regression testing tool that aims to diversify the test scenarios (road maps) to execute for testing and validating self-driving cars. DRVN harnesses the power of convolutional neural networks to identify possible failing roads in a set of generated examples before applying a greedy algorithm that selects and prioritizes the most diverse roads during regression testing. Initial testing discovered that DRVN performed well against random-based test selection. ...
Conference paper (2025) - A.M. Abdullin, Pouria Derakhshanfar, Annibale Panichella
Generating tests automatically is a key and ongoing area of focus in software engineering research. The emergence of Large Language Models (LLMs) has opened up new op-portunities, given their ability to perform a wide spectrum of tasks. However, the effectiveness of LLM -based approaches compared to traditional techniques such as search-based software testing (SBST) and symbolic execution remains uncertain. In this paper, we perform an extensive study of automatic test generation approaches based on three tools: EvoSuite for SBST, Kex for symbolic execution, and TestSpark for LLM-based test generation. We evaluate tools' performance on the GitBug Java dataset and compare them using various execution-based and feature-based metrics. Our results show that while LLM-based test generation is promising, it falls behind traditional methods w.r.t. coverage. However, it significantly outperforms them in mutation scores, suggesting that LLMs provide a deeper semantic understanding of code. LLM-based approach performed worse than SBST and symbolic execution-based approaches w.r.t. fault detection capabilities. Additionally, our feature-based analysis shows that all tools are affected by the complexity and internal dependencies of the class under test (CUT), with LLM-based approaches being especially sensitive to the CUT size. ...

Filtering Spectra with Compiler Information

Conference paper (2025) - L.H. Applis, M. P. Gissurarson, A. Panichella
Spectrum-based fault localization and its formulas often struggle with large spectra containing many expressions irrelevant to the fault, which impacts its overall effectiveness. Spectra can inflate for large programs or on finer granularity, such as expression-level coverage from other languages like Haskell. To address this, we introduce 25 rules to filter the spectra based on type information, AST attributes, and test results. These aim to reduce the suspiciousness of innocent locations (bug-free expressions) and improve the performance of SBFL formulas w.r.t. TOP50 and TOP100 metrics. Our experiment, conducted on 11 Haskell programs, shows that individual filters significantly reduce spectra size, although some data points (faulty expressions) become unsolvable. By applying established SBFL formulas like Ochiai and Tarantula to these reduced spectra, we observe average improvements of up to 40% w.r.t. TOP50 for individual soft rules, such as proximity to failure. Combining the best-performing filters yields improvements of 45.5% for Ochiai, 67.4% for DStar2, and 45.5% for Tarantula. The most effective filtering rules over all formulas captured proximity to failing expressions, usage of a non-unique type, and whether a failing test covered the expression. Our results suggest that simple, straightforward filters can produce substantial performance gains. We further identify 4 uncovered bugs originating from code generation (common in functional programming) and system tests, which can not be addressed purely by spectrumbased fault localization. ...
Large language models and deep learning models designed for code intelligence have revolutionized the software engineering field due to their ability to perform various code-related tasks. These models can process source code and software artifacts with high accuracy in tasks such as code completion, defect detection, and code summarization; therefore, they can potentially become an integral part of modern software engineering practices. Despite these capabilities, robustness remains a critical quality attribute for deep-code models as they may produce different results under varied and adversarial conditions (e.g., variable renaming). Metamorphic testing has become a widely used approach to evaluate models’ robustness by applying semantic-preserving transformations to input programs and analyzing the stability of model outputs. While prior research has explored testing deep learning models, this systematic literature review focuses specifically on metamorphic testing for deep code models. By studying 45 primary papers, we analyze the transformations, techniques, and evaluation methods used to assess robustness. Our review summarizes the current landscape, identifying frequently evaluated models, programming tasks, datasets, target languages, and evaluation metrics, and highlights key challenges and future directions for advancing the field. ...
Conference paper (2024) - C.S. Cao, Simon Schneider, Nicolás E. Díaz Ferreyra, S.E. Verwer, A. Panichella, Riccardo Scandariato
The microservice architecture allows developers to divide the core functionality of their software system into multiple smaller services. However, this architectural style also makes it harder for them to debug and assess whether the system's deployment conforms to its implementation. We present CATMA, an automated tool that detects non-conformances between the system's deployment and implementation. It automatically visualizes and generates potential interpretations for the detected discrepancies. Our evaluation of CATMA shows promising results in terms of performance and providing useful insights. CATMA is available at https://cyber-analytics.nl/catma.github.io/, and a demonstration video is available at https://youtu.be/WKP1hG-TDKc. ...
Book chapter (2024) - Annibale Panichella, Mitchell Olsthoorn
Many-objective evolutionary algorithms (MOEAs) have been applied in the software testing literature to automate the generation of test cases. While previous studies confirmed the superiority of MOEAs over other algorithms, one of the open challenges is maintaining a strong selective pressure considering the large number of objectives to optimize (coverage targets). This paper investigates four density estimators as a substitute for the traditional crowding distance. In particular, we consider two estimators previously proposed in the evolutionary computation community, namely the subvector-dominance assignment (SD) and the epsilon-dominance assignment (ED). We further propose two novel density estimators specific to test case generation, namely the token-based density estimator (TDE) and the path-based density estimator (PDE). Based on the CodeBERT model tokenizer, TDE uses natural language processing to measure the semantic distance between test cases. PDE, on the other hand, considers the distance between the source-code paths executed by the test cases. We evaluate these density estimators within EvoSuite on 100 non-trivial Java classes from the SF110 benchmark. Our results show that the proposed path-based density estimator (PDE) outperforms all other density estimators in enhancing mutation scores. It increases mutation scores by 4.26 % on average (with a max of over 60%) to the traditional crowding distance. ...
Adversarial examples remain a critical concern for the robustness of deep learning models, showcasing vulnerabilities to subtle input manipulations. While earlier research focused on generating such examples using white-box strategies, later research focused on gradient-based black-box strategies, as models' internals often are not accessible to external attackers. This paper extends our prior work by exploring a gradient-free search-based algorithm for adversarial example generation, with particular emphasis on differential evolution (DE). Building on top of the classic DE operators, we propose five variants of gradient-free algorithms: a single-objective approach (GADE), two multi-objective variations (NSGA-IIDE and MOEA/DDE), and two many-objective strategies (NSGA-IIIDE and AGE-MOEADE). Our study on five canonical image classification models shows that whilst GADE variant remains the fastest approach, NSGA-IIDE consistently produces more minimal adversarial attacks (i.e., with fewer image perturbations). Moreover, we found that applying a post-process minimization to our adversarial images, would further reduce the number of changes and overall delta variation (image noise). ...
Over the last decades, various tools (e.g., AUSTIN and EvoSuite) have been developed to automate the process of unit-level test case generation. Most of these tools are designed for statically-typed languages, such as C and Java. However, as is shown in recent Stack Overflow developer surveys, the popularity of dynamicallytyped languages, such as JavaScript and Python, has been increasing and is dominating the charts. Only recently, tools for automated test case generation of dynamically-typed languages have started to emerge (e.g., Pynguin for Python). However, to the best of our knowledge, there is no tool that focuses on automated test case generation for server-side JavaScript. To this aim, we introduce SynTest-JavaScript, a user-friendly tool for automated unit-level test case generation for (server-side) JavaScript. To showcase the effectiveness of SynTest-JavaScript, we empirically evaluate it on five large open-source JavaScript projects ...
Conference paper (2024) - Călin Georgescu, Mitchell Olsthoorn, Pouria Derakhshanfar, Marat Akhin, Annibale Panichella
Compiler correctness is a cornerstone of reliable software development. However, systematic testing of compilers is infeasible, given the vast space of possible programs and the complexity of modern programming languages. In this context, differential testing offers a practical methodology as it addresses the oracle problem by comparing the output of alternative compilers given the same set of programs as input. In this paper, we investigate the effectiveness of differential testing in finding bugs within the Kotlin compilers developed at JetBrains. We propose a black-box generative approach that creates input programs for the K1 and K2 compilers. First, we build workable models of Kotlin semantic (semantic interface) and syntactic (enriched context-free grammar) language features, which are subsequently exploited to generate random code snippets. Second, we extend random sampling by introducing two genetic algorithms (GAs) that aim to generate more diverse input programs. Our case study shows that the proposed approach effectively detects bugs in K1 and K2; these bugs have been confirmed and (some) fixed by JetBrains developers. While we do not observe a significant difference w.r.t. the number of defects uncovered by the different search algorithms, random search and GAs are complementary as they find different categories of bugs. Finally, we provide insights into the relationships between the size, complexity, and fault detection capability of the generated input programs. ...
Conference paper (2024) - J. Sallou, T. Durieux, A. Panichella
Large Language Models (LLMs) have gained considerable traction within the Software Engineering (SE) community, impacting various SE tasks from code completion to test generation, from program repair to code summarization. Despite their promise, researchers must still be careful as numerous intricate factors can influence the outcomes of experiments involving LLMs.
This paper initiates an open discussion on potential threats to the validity of LLM-based research including issues such as closed-source models, possible data leakage between LLM training data and research evaluation, and the reproducibility of LLM-based findings.
In response, this paper proposes a set of guidelines tailored for SE researchers and Language Model (LM) providers to mitigate these concerns.
The implications of the guidelines are illustrated using existing good practices followed by LLM providers and a practical example for SE researchers in the context of test case generation. ...
Conference paper (2024) - Arkadii Sapozhnikov, Mitchell Olsthoorn, A. Panichella, V.V. Kovalenko, P. Derakhshanfar
Writing software tests is laborious and time-consuming. To address this, prior studies introduced various automated test-generation techniques. A well-explored research direction in this field is unit test generation, wherein artificial intelligence (AI) techniques create tests for a method/class under test. While many of these techniques have primarily found applications in a research context, existing tools (e.g., EvoSuite, Randoop, and AthenaTest) are not user-friendly and are tailored to a single technique. This paper introduces TestSpark, a plugin for IntelliJ IDEA that enables users to generate unit tests with only a few clicks directly within their Integrated Development Environment (IDE). Furthermore, TestSpark also allows users to easily modify and run each generated test and integrate them into the project workflow. TestSpark leverages the advances of search-based test generation tools, and it introduces a technique to generate unit tests using Large Language Models (LLMs) by creating a feedback cycle between the IDE and the LLM. Since TestSpark is an open-source (https://github.com/JetBrains-Research/TestSpark), extendable, and well-documented tool, it is possible to add new test generation methods into the plugin with the minimum effort. This paper also explains our future studies related to TestSpark and our preliminary results. Demo video: https://youtu.be/0F4PrxWfiXo ...
Journal article (2024) - Imara van Dinten, Pouria Derakhshanfar, Annibale Panichella, Andy Zaidman
Cyber-Physical Systems (CPSs) have gained traction in recent years. A major non-functional quality of CPS is performance since it affects both usability and security. This critical quality attribute depends on the specialized hardware, simulation engines, and environmental factors that characterize the system under analysis. While a large body of research exists on performance issues in general, studies focusing on performance-related issues for CPSs are scarce. The goal of this paper is to build a taxonomy of performance issues in CPSs. To this aim, we present two empirical studies aimed at categorizing common performance issues (Study I) and helping developers detect them (Study II). In the first study, we examined commit messages and code changes in the history of 14 GitHub-hosted open-source CPS projects to identify commits that report and fix self-admitted performance issues. We manually analyzed 2699 commits, labeled them, and grouped the reported performance issues into antipatterns. We detected instances of three previously reported Software Performance Antipatterns (SPAs) for CPSs. Importantly, we also identified new SPAs for CPSs not described earlier in the literature. Furthermore, most performance issues identified in this study fall into two new antipattern categories: Hard Coded Fine Tuning (399 of 646) and Magical Waiting Number (150 of 646). In the second study, we introduce static analysis techniques for automatically detecting these two new antipatterns; we implemented them in a tool called AP-Spotter. We analyzed 9 open-source CPS projects not utilized to build the SPAs taxonomy to benchmark AP-Spotter. Our results show that AP-Spotter achieves 62.04% precision in detecting the antipatterns ...
Conference paper (2023) - L.S. Veldkamp, Mitchell Olsthoorn, A. Panichella
Web Application Programming Interfaces (APIs) allow systems to be addressed programmatically and form the backbone of the internet. RESTful and RPC APIs are among the most common API architectures used. In the last decades, researchers have proposed various techniques for automated testing of RESTful APIs, however, to the best of the authors' knowledge there exists no work on testing JSON-RPC (one of the two data formats supported by RPC) APIs. To address this limitation, we propose a grammar-based evolutionary fuzzing approach for testing JSON-RPC APIs that uses a novel black-box heuristic. Specifically, we use a diversity-based fitness function based on hierarchical clustering to quantity the differences in API method responses. Our hypothesis is that responses that are unlike previously seen ones are an indication that new uncovered code paths are reached. We evaluate our approach on the XRP ledger, a large-scale industrial blockchain system that uses JSON-RPC APIs. Our results show that the proposed approach performs significantly better than the baseline (grammar-based fuzzer) and covers an additional 240 branches. ...
Deep learning (DL) models are known to be highly accurate, yet vulnerable to adversarial examples. While earlier research focused on generating adversarial examples using whitebox strategies, later research focused on black-box strategies, as models often are not accessible to external attackers. Prior studies showed that black-box approaches based on approximate gradient descent algorithms combined with meta-heuristic search (i.e., the BMI-FGSM algorithm) outperform previously proposed white- and black-box strategies. In this paper, we propose a novel black-box approach purely based on differential evolution (DE), i.e., without using any gradient approximation method. In particular, we propose two variants of a customized DE with customized variation operators: (1) a single-objective (Pixel-SOO) variant generating attacks that fool DL models, and (2) a multi-objective variant (Pixel-MOO) that also minimizes the number of changes in generated attacks. Our preliminary study on five canonical image classification models shows that Pixel-SOO and Pixel-MOO are more effective than the state-of-the-art BMI-FGSM in generating adversarial attacks. Furthermore, Pixel-SOO is faster than Pixel-MOO, while the latter produces subtler attacks than its single-objective variant. ...