LA

L.H. Applis

info

Please Note

8 records found

Filtering Spectra with Compiler Information

Conference paper (2025) - L.H. Applis, M. P. Gissurarson, A. Panichella
Spectrum-based fault localization and its formulas often struggle with large spectra containing many expressions irrelevant to the fault, which impacts its overall effectiveness. Spectra can inflate for large programs or on finer granularity, such as expression-level coverage from other languages like Haskell. To address this, we introduce 25 rules to filter the spectra based on type information, AST attributes, and test results. These aim to reduce the suspiciousness of innocent locations (bug-free expressions) and improve the performance of SBFL formulas w.r.t. TOP50 and TOP100 metrics. Our experiment, conducted on 11 Haskell programs, shows that individual filters significantly reduce spectra size, although some data points (faulty expressions) become unsolvable. By applying established SBFL formulas like Ochiai and Tarantula to these reduced spectra, we observe average improvements of up to 40% w.r.t. TOP50 for individual soft rules, such as proximity to failure. Combining the best-performing filters yields improvements of 45.5% for Ochiai, 67.4% for DStar2, and 45.5% for Tarantula. The most effective filtering rules over all formulas captured proximity to failing expressions, usage of a non-unique type, and whether a failing test covered the expression. Our results suggest that simple, straightforward filters can produce substantial performance gains. We further identify 4 uncovered bugs originating from code generation (common in functional programming) and system tests, which can not be addressed purely by spectrumbased fault localization. ...
Doctoral thesis (2024) - L.H. Applis, A. van Deursen, A. Panichella
Finding and fixing software faults is a major part of software development and as such any improvement for such tasks is a welcome aid for developers and a worthwhile field for researchers. Like programming in general, debugging and repair need specialized tools to provide the necessary information (like the usage of runtime resources) or assure quality (e.g. with test suites). Only then, developers are able to repair faults without introducing new ones. There are also more sophisticated tools that provide stronger, more automated help to developers: Program coverage summarizes run-time behavior, fault localization helps to narrow down suspicious parts of the code and automated program repair suggests possible patches that lead to a passing test suite. On top of these approaches, large language models show promising capabilities to generate, alter and test source-code, but they have yet to be tested and hardened for their security and quality.
To enable the next generation of state of the art quality-assurance tools, this thesis investigates different techniques and their respective tools to improve their precision and correctness. To this end, we develop procedures to quantify the robustness of large language models of code to identify their weaknesses when facing metamorphic noise or statistically unlikely data. After examining quality of tools, this dissertation works towards improving existing tools and approaches in the field of functional programming, particularly for Haskell. Functional programming is a field rich of unique options such as properties, strong type-systems, side-effect free functions, but also challenges like non-strict evaluation.
Our results regarding large language models show that there are short-comings when dealing with redundant elements and that such elements can be intentionally searched for. This implies a need for further improvement of the models, to provide more consistent results for trivial changes.
The work centered around Haskell shows the value of utilizing compiler- and language-features to enhance existing techniques: Program repair can be performed with a reduced search space due to compiler-suggested elements, stack-traces and program-coverage can be enhanced by introducing an evaluation-trace and fault-localization is aided by types and expression-level granularity. While the implementation is specific, the approaches remain transferable: Any feature that is used from Haskell in this dissertation, is (or can be) implemented for Java.
In summary, this thesis touches on different topics of assuring software quality and their tools by introducing novel information. This thesis lays groundwork to improve the next generation of development-tools that utilize large language models or statically typed languages. ...
Conference paper (2023) - L.H. Applis, A. Panichella, R.J. Marang
More machine learning (ML) models are introduced to the field of Software Engineering (SE) and reached a stage of maturity to be considered for real-world use; But the real world is complex, and testing these models lacks often in explainability, feasibility and computational capacities. Existing research introduced meta-morphic testing to gain additional insights and certainty about the model, by applying semantic-preserving changes to input-data while observing model-output. As this is currently done at random places, it can lead to potentially unrealistic datapoints and high computational costs. With this work, we introduce genetic search as an aid for metamorphic testing in SE ML. Exploiting the delta in output as a fitness function, the evolutionary intelligence optimizes the transformations to produce higher deltas with less changes. We perform a case study minimizing F1 and MRR for Code2Vec on a representative sample from java-small with both genetic and random search. Our results show that within the same amount of time, genetic search was able to achieve a decrease of 10% in F1 while random search produced 3% drop. ...

Haskell - Tracing Lazy Evaluations in a Functional Language

Conference paper (2023) - Matthías Páll Gissurarson, Leonhard Herbert Applis
In non-strict languages such as Haskell the execution of individual expressions in a program significantly deviates from the order in which they appear in the source code. This can make it difficult to find bugs related to this deviation, since the evaluation of expressions does not occur in the same order as in the source code. At the moment, Haskell errors focus on values being produced, whereas it is often the case that faults are due to values being consumed. For non-strict languages, values involved in a bug are often generated immediately prior to the evaluation of the buggy code. This creates an opportunity for evaluation traces, tracking recently evaluated locations (which can deviate from call-order) to help establish the origin of values involved in faults. In this paper, we describe an extension of GHC's Haskell Program Coverage with evaluation traces, recording recent evaluations in the coverage file, and reporting an evaluation trace alongside the call stack on exception. This lets us reconstruct the chain of events and locate the origin of faults. As a case study, we applied our initial implementation to the nofib-buggy data set and found that some runtime errors greatly benefit from trace information. ...
Conference paper (2023) - Leonhard Applis, Annibale Panichella
We present HasBugs, an extensible and manually-curated dataset of real-world 25 Haskell Bugs from 6 open source repositories. We provide a faulty, tested, and fixed version of each bug in our dataset with reproduction packages, description, and bug context. For technical users, the dataset is meant to either help researchers adapt techniques from other programming languages to Haskell or to provide a human-verified gold standard for tools evaluation and enable future reproducibility. We also see applicability for qualitative research, e.g., by analysis of bug lifecycles and comparison to other languages. We provide a companion website for easy access and overview under https://ciselab.github.io/HasBugs/. ...
Conference paper (2022) - Matthías Páll Gissurarson, L.H. Applis, A. Panichella, A. van Deursen, David Sands
Automatic program repair (APR) regularly faces the challenge of overfitting patches — patches that pass the test suite, but do not actually address the problems when evaluated manually. Currently, overfit detection requires manual inspection or an oracle making quality control of APR an expensive task. With this work, we want to introduce properties in addition to unit tests for APR to address the problem of overfitting. To that end, we design and im- plement PropR, a program repair tool for Haskell that leverages both property-based testing (via QuickCheck) and the rich type system and synthesis offered by the Haskell compiler. We compare the repair-ratio, time-to-first-patch and overfitting-ratio when using unit tests, property-based tests, and their combination. Our results show that properties lead to quicker results and have a lower overfit ratio than unit tests. The created overfit patches provide valuable insight into the underlying problems of the program to repair (e.g., in terms of fault localization or test quality). We consider this step towards fitter, or at least insightful, patches a critical contribution to bring APR into developer workflows. ...

Refocussing SE ML on the Homo Sapience

Abstract (2022) - L.H. Applis
Many tasks in machine learning for software engineering
rely on prominent NLP metrics, such as the BLEU or
ROUGE score. The metrics are under heavy criticism themselves
within the NLP community, but the SE community adapted them
for lack of better alternatives. Within this paper, we summarize
some of the problems with common metrics at the examples of
code and look for alternatives. We argue that our only hope is
the worst of all possible options: Humans. ...
Conference paper (2021) - L.H. Applis, A. Panichella, A. van Deursen
Metamorphic testing is a well-established testing technique that has been successfully applied in various domains, including testing deep learning models to assess their robustness against data noise or malicious input. Currently, metamorphic testing approaches for machine learning (ML) models focused on image processing and object recognition tasks. Hence, these approaches cannot be ap- plied to ML targeting program analysis tasks. In this paper, we extend metamorphic testing approaches for ML models targeting software programs. We present Lampion, a novel testing frame- work that applies (semantics preserving) metamorphic transforma- tions on the test datasets. Lampion produces new code snippets equivalent to the original test set but different in their identifiers or syntactic structure. We evaluate Lampion against CodeBERT, a state-of-the-art ML model for Code-To-Text tasks that creates Javadoc summaries for given Java methods. Our results show that simple transformations significantly impact the target model be- havior, providing additional information on the models reasoning apart from the classic performance metric. ...