L.H. Applis
Please Note
8 records found
1
Suspicious Types and Bad Neighborhoods
Filtering Spectra with Compiler Information
To enable the next generation of state of the art quality-assurance tools, this thesis investigates different techniques and their respective tools to improve their precision and correctness. To this end, we develop procedures to quantify the robustness of large language models of code to identify their weaknesses when facing metamorphic noise or statistically unlikely data. After examining quality of tools, this dissertation works towards improving existing tools and approaches in the field of functional programming, particularly for Haskell. Functional programming is a field rich of unique options such as properties, strong type-systems, side-effect free functions, but also challenges like non-strict evaluation.
Our results regarding large language models show that there are short-comings when dealing with redundant elements and that such elements can be intentionally searched for. This implies a need for further improvement of the models, to provide more consistent results for trivial changes.
The work centered around Haskell shows the value of utilizing compiler- and language-features to enhance existing techniques: Program repair can be performed with a reduced search space due to compiler-suggested elements, stack-traces and program-coverage can be enhanced by introducing an evaluation-trace and fault-localization is aided by types and expression-level granularity. While the implementation is specific, the approaches remain transferable: Any feature that is used from Haskell in this dissertation, is (or can be) implemented for Java.
In summary, this thesis touches on different topics of assuring software quality and their tools by introducing novel information. This thesis lays groundwork to improve the next generation of development-tools that utilize large language models or statically typed languages. ...
To enable the next generation of state of the art quality-assurance tools, this thesis investigates different techniques and their respective tools to improve their precision and correctness. To this end, we develop procedures to quantify the robustness of large language models of code to identify their weaknesses when facing metamorphic noise or statistically unlikely data. After examining quality of tools, this dissertation works towards improving existing tools and approaches in the field of functional programming, particularly for Haskell. Functional programming is a field rich of unique options such as properties, strong type-systems, side-effect free functions, but also challenges like non-strict evaluation.
Our results regarding large language models show that there are short-comings when dealing with redundant elements and that such elements can be intentionally searched for. This implies a need for further improvement of the models, to provide more consistent results for trivial changes.
The work centered around Haskell shows the value of utilizing compiler- and language-features to enhance existing techniques: Program repair can be performed with a reduced search space due to compiler-suggested elements, stack-traces and program-coverage can be enhanced by introducing an evaluation-trace and fault-localization is aided by types and expression-level granularity. While the implementation is specific, the approaches remain transferable: Any feature that is used from Haskell in this dissertation, is (or can be) implemented for Java.
In summary, this thesis touches on different topics of assuring software quality and their tools by introducing novel information. This thesis lays groundwork to improve the next generation of development-tools that utilize large language models or statically typed languages.
More machine learning (ML) models are introduced to the field of Software Engineering (SE) and reached a stage of maturity to be considered for real-world use; But the real world is complex, and testing these models lacks often in explainability, feasibility and computational capacities. Existing research introduced meta-morphic testing to gain additional insights and certainty about the model, by applying semantic-preserving changes to input-data while observing model-output. As this is currently done at random places, it can lead to potentially unrealistic datapoints and high computational costs. With this work, we introduce genetic search as an aid for metamorphic testing in SE ML. Exploiting the delta in output as a fitness function, the evolutionary intelligence optimizes the transformations to produce higher deltas with less changes. We perform a case study minimizing F1 and MRR for Code2Vec on a representative sample from java-small with both genetic and random search. Our results show that within the same amount of time, genetic search was able to achieve a decrease of 10% in F1 while random search produced 3% drop.
CSI
Haskell - Tracing Lazy Evaluations in a Functional Language
In non-strict languages such as Haskell the execution of individual expressions in a program significantly deviates from the order in which they appear in the source code. This can make it difficult to find bugs related to this deviation, since the evaluation of expressions does not occur in the same order as in the source code. At the moment, Haskell errors focus on values being produced, whereas it is often the case that faults are due to values being consumed. For non-strict languages, values involved in a bug are often generated immediately prior to the evaluation of the buggy code. This creates an opportunity for evaluation traces, tracking recently evaluated locations (which can deviate from call-order) to help establish the origin of values involved in faults. In this paper, we describe an extension of GHC's Haskell Program Coverage with evaluation traces, recording recent evaluations in the coverage file, and reporting an evaluation trace alongside the call stack on exception. This lets us reconstruct the chain of events and locate the origin of faults. As a case study, we applied our initial implementation to the nofib-buggy data set and found that some runtime errors greatly benefit from trace information.
BLEU it All Away!
Refocussing SE ML on the Homo Sapience
rely on prominent NLP metrics, such as the BLEU or
ROUGE score. The metrics are under heavy criticism themselves
within the NLP community, but the SE community adapted them
for lack of better alternatives. Within this paper, we summarize
some of the problems with common metrics at the examples of
code and look for alternatives. We argue that our only hope is
the worst of all possible options: Humans. ...
rely on prominent NLP metrics, such as the BLEU or
ROUGE score. The metrics are under heavy criticism themselves
within the NLP community, but the SE community adapted them
for lack of better alternatives. Within this paper, we summarize
some of the problems with common metrics at the examples of
code and look for alternatives. We argue that our only hope is
the worst of all possible options: Humans.