Circular Image

A. van Deursen

info

Please Note

106 records found

Refactoring is a critical part of the software development lifecycle, and identifier renaming accounts for roughly 15% of all agentic refactoring work driven by large language models. Yet the dominant model families fit the task poorly. Autoregressive decoders generate left to right, and even with the fill-in-the-middle extension they resolve masked positions one at a time, so a renaming decision at one site cannot inform a decision at another. Identifier renaming, however, demands consistency across every affected site at once. Diffusion Large Language Models (dLLMs) generate by iteratively denoising a masked sequence under full bidirectional attention, with every prediction conditioned on every other. This matches what renaming needs: if a poorly named identifier is viewed as a small amount of semantic noise overlaid on correct code, then renaming becomes a targeted denoising task that can be solved jointly across all affected sites.

We instantiate this view as CoReFusion, the first systematic study of dLLMs on Java identifier renaming, and benchmark them against twelve decoder-only FIM-AR baselines and five encoder-decoder Seq2Seq baselines on the RefineID dataset. DreamCoder-7B and DiffuCoder-7B reach 33.2% and 31.1% Exact Match, beating the best non-dLLM model (CodeT5-large) by more than ten points while being roughly nine times smaller than the largest FIM-AR baseline. The advantage grows with the number of identifiers that must be renamed together: FIM-AR models win the single-site case, but dLLMs pull ahead as soon as the task involves more than one site. When the same dLLMs must instead find the positions on their own, Exact Match drops to about 3%, and most wrong predictions copy the lexical style of the surrounding code rather than improve on it. Probing the internal states of DiffuCoder-7B shows why: the signal that tells a bad name from a good one appears only in the last few layers and the last few denoising steps, after the unmasking schedule has already confirmed most of its predictions. Providing the rename positions as masks bypasses this timing problem, which is why dLLMs work as filling engines but not as standalone refactoring agents. ...
Doctoral thesis (2026) - P. Altmeyer, C.C.S. Liem, A. van Deursen
Many of the most celebrated recent advances in artificial intelligence (AI) have been built on the back of highly complex and opaque models that need little human oversight to achieve strong predictive performance. But while their capacity to recognize patterns from raw data is impressive, their decision-making process is neither robust nor well understood. This has so far inhibited trust and widespread adoption of these technologies. This thesis contributes to research efforts aimed at tackling these challenges, through interdisciplinary insights and methodological contributions.

The principle goal of this work is to contribute methods that help us in making opaque AI models more trustworthy. Specifically, we aim to (1) explore and challenge existing technologies and paradigms in the field; (2) improve our ability to hold opaque models accountable through thorough scrutiny; and, (3) leverage the results of such scrutiny during training to improve the trustworthiness of models. Methodologically, the thesis focuses on counterfactual explanations and algorithmic recourse for individuals subjected to opaque AI systems. We explore what type of real-world dynamics can be expected to play out when recourse is provided and implemented in practice. Based on our finding that individual cost minimization–a core objective in recourse–neglects hidden external costs of recourse itself, we revisit yet another established objective: namely, that explanations should be plausible first and foremost. Our work demonstrates that a narrow focus on this objective can mislead us into trusting fundamentally untrustworthy systems. To avoid this scenario, we propose a novel method that aids us in disclosing explanations that are maximally faithful, that is consistent with the behavior of models. This not only allows us to assess the trustworthiness of models, but also improve it: we show that faithful explanations can be used during training to ensure that models learn plausible explanations.

Finally, we also critically assess efforts towards trustworthy AI in the context of modern large language models (LLM). Specifically, we cast doubt on recent findings and practices presented in the field of mechanistic interpretability and caution our fellow researchers in this space against misinterpreting and inflating their findings.

In summary, this thesis makes cutting-edge research contributions that improve our ability to make opaque AI models more trustworthy. Beyond our core research contributions, this thesis makes substantial contributions to open-source software. Through various software packages that we have developed, we make our research and that of others more accessible.
...
The adoption of AI systems across various sectors has increased considerably in recent years. This is a consequence of the remarkable capability of AI to extract insights from large-scale datasets, improve personalization, automate tasks and complex processes within organizations, and support more informed decision-making. Notable examples include the financial sector, where AI is applied to monitor transactions and accelerate credit decision processes; healthcare, where AI contributes to drug discovery and assists clinicians in the early diagnosing; manufacturing, where predictive maintenance using AI systems help reduce costs and mitigate the risks associated with unexpected failures; and software engineering, where AI supports anomaly detection, fault prediction, and resource demand forecasting in large-scale, complex systems.
Despite the widespread adoption and potential of AI systems, most research has been focused on model development, while investigations into their lifecycle and evolution in production environments remain at an early stage. This research path is particularly relevant for AI practitioners, who are responsible for ensuring the reliability, functionality, and predictive accuracy of deployed systems. To bridge the gap between scientific research and the practical needs of industry practitioners, this thesis focuses on two key aspects of the AI lifecycle: techniques for monitoring and maintaining AI systems over time.....
...
Doctoral thesis (2026) - C.R. van der Rest, A. van Deursen, N. Yorke-Smith, C. Bach
Type systems are a tool for preventing software errors, by classifying (sub)terms according to how they are evaluated. This way, mistakes can be caught at compile-time, ruling out the existence of entire classes of mistakes altogether. Using a programming language with a strong type system to develop critical software can dramatically reduce the prevalence and impact of bugs.
In light of the potentially enormous impact of bugs, it is important that we can trust a type system to be succesful in preventing errors. A key property of type systems that reflects this criterium is type soundness, which establishes that “well-typed programs cannot go wrong”. That is, programs that are deemed safe by the type system should not exhibit certain wrong behaviour when executed. To gain trust in a type system’s ability to prevent errors, we can give a formal system of both a language’s type system and semantics, and mathematically verify that the type system is sound with respect to the defined semantics. While this provides airtight evidence that a type system is succesful in ruling out certain mistakes, the formal specification and verification of programming languages requires a formidable amount of time and expertise on behalf of the language designer, and is therefore infeasible in most cases.... ...
Master thesis (2025) - S.R. Sunnevudóttir, A. van Deursen, M.J.G. Olsthoorn, Pouria Derakhshanfar, M.A. Costea
Automated test generation is a critical area of research in software engineering, aiming to reduce manual effort while improving software reliability. While substantial work has focused on statically typed languages, dynamically typed languages such as JavaScript remain underexplored despite their widespread use and unique challenges. This thesis investigates the current status of JavaScript test generation by systematically evaluating state-of-the-art search-based and large language model-based tools.

We first analyze existing benchmarks to assess their coverage of representative language features, identifying gaps that limit the ability to fairly compare tool performance. We then construct a curated dataset of real-world JavaScript projects and evaluate the LLM-based tool TestPilot and the search-based tool SynTest using a combination of quantitative metrics (e.g., code coverage, pass rates) and feature based correlation analysis. Our results reveal that TestPilot tends to generate higher coverage (median 27.9\% vs 11.2\% branch coverage) and more readable tests but produces a larger number of failing or low-value test cases, while SynTest generates more stable and focused test suites yet can struggle with complex or dynamic code constructs. Our similarity analysis shows that each approach achieved unique coverage, suggesting complementary strengths.

This study highlights the need for standardized, language-aware benchmarks and introduces a curated dataset and evaluation framework for evaluating JavaScript test generation tools. By systematically comparing search-based and LLM-based approaches, this thesis offers insights into their respective strengths, limitations, and opportunities for hybrid strategies, advancing the state of automated testing for dynamically typed languages. ...
As Large Language Models become an ever more integral part of Software Engineering, often assisting developers on coding tasks, the need for an unbiased evaluation of their performance on such tasks grows [1]. Data smells [2] are reported to have an impact on a Large Language Model’s ability on such tasks [ 3]. Boilerplate code is considered to be a subcategory of said smells. In this paper, we investigate a specific type of this smell, boilerplate API usage patterns. We analyze their prevalence in The Heap dataset [1] and examine how they may bias reference-based evaluation of Large Language Models on code generation tasks. Our findings show that while this data smell is relatively rare, instances containing it are significantly easier for LLMs to predict. We attribute this to partial memorization of common boilerplate patterns, which inflates perceived model performance. ...
This paper investigates the relation between the educational value of input code and the subsequent inference performance of code large language models (LLMs) on completion tasks. Results were attained using The Heap dataset and using SmolLM2, StarCoder 2 and Mellum models. Performance was measured by comparing the generated outputs with the ground truth, where high similarity indicates high performance. We analyse how factors such as language, model size, task type and granularity of educational value affect performance across educational value. We find that most factors do not have a relation with education value, as most metrics plateau except for exact-match. It is observed to have a consistent negative correlation with educational value. Additionally, a consistent turning point is seen around an educational value of 1.75, before which, performance tends to have a more positive relation with educational value. Results highlight the influence of input quality on LLM behaviour and offer insights for more effective training and evaluation strategies. ...
Large Language Models (LLMs) are increasingly integrated into development workflows for tasks such as code completion, bug fixing, and refactoring. While prior work has shown that removing low-quality data—including data smells like Self-Admitted Technical Debt (SATD)—from training data can improve model performance, the isolated effect of SATD at inference time remains unclear.

This study investigates the impact of SATD on LLM performance during code completion. Using The Heap dataset, we annotate over 5 million Java files with SATD bitmasks and construct a set of input–target pairs based on varying SATD contexts and masking strategies. Three code generation models, SmolLM2, StarCoder2, and Mellum, are evaluated on both comment and method generation tasks using standard text-based metrics and manual semantic classification.

Our results show that the presence of SATD in input has a negligible effect on generation quality. Instead, performance is primarily driven by target method length, structural complexity, and context size. We also find that metrics may misrepresent semantic correctness in the presence of non-functional elements such as comments. These findings suggest that careful control of target complexity is more critical than the presence of SATD alone when evaluating LLM performance on code. ...
Bachelor thesis (2025) - J.K. Wierzbicki, T.J. Coopmans, A. van Deursen
The Byzantine agreement problem in computer science focuses on honest parties trying to achieve consensus in a network with malicious actors. The performance of a quantum-aided Byzantine agreement protocol was evaluated under more realistic noise conditions, with a particular focus on gate-level errors. Since quantum systems are affected by various forms of noise, understanding the impact of quantum noise is crucial for assessing the practical viability and robustness of such protocols. Our results indicate a gate error probability threshold of 0.001%, below which the protocol maintains a failure probability of less than 5%. However, only a single source of noise was considered, with the depolarizing probabilities for single- and two-qubit gates assumed to be equal. This noise level closely matches the currently achievable error rate for single-qubit gates, but is over an order of magnitude lower than that for two-qubit gates on quantum network hardware. Consequently, our findings suggest that the protocol, in its current form, requires further research before it can be deployed on existing quantum devices. Moreover, the results strongly indicate that two-qubit gate errors are the primary bottleneck. These results highlight the significant impact of quantum noise on distributed quantum protocols and underscore the need for either improved quantum hardware or enhanced fault-tolerant protocol designs. ...
Master thesis (2025) - A. Dumitriu, A. van Deursen, S.S. Chakraborty
Developing correct concurrent data structures under weak memory models presents significant challenges due to subtle concurrency errors arising from relaxed ordering guarantees and complexities in Safe Memory Reclamation. Existing synthesis methods largely assume sequential consistency, overlooking critical reorderings allowed by realistic architectures.

This thesis introduces a synthesis-verification pipeline that iteratively generates concurrent data structures from partial code specifications using Large Language Models. The pipeline is expanded by integrating an advanced model checker, GenMC, enhanced specifically to verify SMR correctness under weak memory through automaton-based hazard pointer verification. This integration provides memory safety guarantees across diverse execution scenarios.

We evaluate our approach using established concurrent data structure benchmarks, demonstrating rapid convergence to correct implementations, outperforming state-of-the-art methods. These results highlight the pipeline’s effectiveness and scalability, illustrating its potential to support researchers in developing novel, reliable concurrent data structures under weak memory models. ...
Delayed software projects are one of the biggest threats to the integrity of many project portfolios. If portfolio managers were able to foresee delays, they could better manage risks, make adjustments to the planning and reduce delay propagation. In their 2023 paper "Dynamic Prediction of Delays in Software Projects using Delay Patterns and Bayesian Modeling", Kula et al. propose an AI solution for the problem of ineffective delay prediction of software projects. Even though Kula et al. achieved positive results, they are bound to ING’s data, and thus may not be representative of software projects in other companies or industries. This thesis builds on Kula et al.’s work by applying the same methodology to a new dataset - Coca-Cola Hellenic's Project Portfolio. By doing so, it assesses the robustness and generalisability of Kula et al.'s delay prediction model. The results clearly indicate that the model was unsuccessful at Coca-Cola Hellenic, as it proved no better than random guessing.
Differences in dataset size and quality were identified as the primary cause for the lack of performance. Furthermore, contextual factors were likely a major contribution to the difference in results, namely differences in industry, organisational structure and agile maturity. These findings are valuable to anyone attempting to replicate this solution, or to organisations aiming to adopt AI-powered analytics.
Future research directions are suggested, such as a requirement framework for AI solutions and further replication of Kula et al.'s work in different contexts. ...
Master thesis (2025) - T.J. Nulle, A. van Deursen, L. Cruz, J. Yang
This thesis investigates reducing carbon emissions in code generation using large language models (LLMs) by comparing function-level and line-level code completions across models of different sizes (1.5B and 9B parameters). The study utilises the BigCodeBench dataset, comprising 1,140 Python programming problems, to evaluate the energy consumption, test accuracy, and time efficiency of code completions. The models, 4-bit quantised and run on a CPU, performed 30 function-level completions and 30 line-level completions for each line, which were tested for correctness. Results indicate that, while line-level completions require slightly more energy per token, they are more efficient overall in terms of total energy consumption and token usage. The smaller model with line-level completions showed significant reductions in carbon emissions, achieving an average tenfold reduction compared to the large model with function-level completions. With the large model, line-level completions achieved a $4.5\times$ reduction in carbon emissions compared to function-level completions. Line-level completions were more token-efficient, wasting less than 1\% of energy, compared to 20\% for function-level completions. From a sustainability perspective, line-level completions offer a practical strategy to reduce the environmental impact of code generation tasks while maintaining strong performance. The study suggests that optimising completion strategies could help balance energy consumption, test accuracy, and time efficiency. Future research could explore a broader range of model sizes, fine-tuning models specifically for line-level completions, a performance decrease in solution length, and alternative validation metrics to assess code generation performance. ...
Master thesis (2025) - R. Popescu, A. van Deursen, M. Izadi, J. Yang
The rapid rise in the popularity of large language models has highlighted the need for extensive datasets, especially for training on code. However, this growth has also raised important questions about the legal implications of using code in large language model training, particularly regarding the potential infringement of code licenses. At the same time, the availability of clean datasets for evaluating these models is becoming increasingly limited, due to a high risk of contamination which restricts the capacity for reliable research. On top of that, this requires researchers to repeatedly perform data curation steps in order to evaluate their models on downstream tasks, based on previously unseen data. This process is not only time- and resource-intensive but also introduces potential inconsistencies across studies, which can impact their reproducibility.
We address these challenges through a comprehensive licensing analysis and by developing robust datasets to support accurate and reproducible large language model evaluations. We compiled a list of 53 large language models trained on file-level code and analyzed their datasets, discovering pervasive license inconsistencies despite careful selection based on repository licenses. Our analysis, covering 514M code files, reveals 38M exact duplicates of strong copyleft code, and 171M file-leading comments, 16M of which are under copyleft licenses and another 11M discouraging unauthorized copying. To further understand the depth of non-permissive code in public training datasets, we developed StackLessV2, a strong copyleft Java dataset decontaminated against The Stack V2 to facilitate accurate model evaluations. Our results revealed that non-permissive code is also present at the near-duplication level, although, this represents a gray area in terms of legal interpretation, where the boundary between acceptable reuse and license violation is still unclear, emphasizing the need for further legal clarification. Finally, we extend on this and introduce The Heap, a large multilingual copyleft dataset covering 57 programming languages, specifically deduplicated to avoid contamination from existing open training datasets. The Heap offers a solution for conducting fair, reproducible evaluations of large language models without the significant overhead of the data curation process. ...
Master thesis (2025) - G.J.B. Vegelien, A. van Deursen, C.E. Brandt, Bas Graaf, A. Katsifodimos
In today’s rapidly evolving software landscape, where continuous integration and continuous delivery are paramount, the presence of flaky tests poses a significant obstacle. These tests, exhibiting unpredictable pass/fail behavior, hinder development progress, waste valuable resources, and erode developer trust. This research delves into the root causes and mitigation strategies for flaky tests within a large-scale, database-driven industrial setting: Exact.

The increasing reliance on databases in modern software systems, including Exact’s own platform, necessitates a deeper understanding of the unique challenges posed by database-dependent tests. By analyzing flaky test behavior through repeated test runs on the same code, we identified key contributors to flakiness, including resource contention, test order dependencies, ‘dirty tests’ that leave the system in an inconsistent state, platform-specific issues, and combinations thereof.

Based on the root causes for flakiness at Exact, we developed and evaluated three mitigation strategies and supporting tools: minimizing redundant database background tasks, explicitly disposing of test data, and disabling database dirty tests. Our study resulted in a substantial reduction in flakiness, leading to a significant increase in the release rate from Exact from 60% to 96%. We improved the chance of their CI/CD pipeline passing with no code changes from 27% to 95%.

Furthermore, this research highlights the importance of collecting and analyzing rich, granular test data to identify patterns and root causes of flakiness. Providing developers with actionable information from this analysis motivates them to address flakiness proactively. Moreover, understanding the interplay between different types of tests, such as the impact of dirty tests on other seemingly unrelated tests or in combination with other factors, is crucial for effectively mitigating cascading failures. ...
Doctoral thesis (2025) - S.A.M. Mir, A. van Deursen, S. Proksch
Software engineering, fundamental to modern technological advancement, profoundly influences various aspects of society by enhancing efficiency, accessibility, and security. This discipline involves systematically applying engineering principles to software systems' design, development, testing, and maintenance. Innovations in software engineering have revolutionized industries such as communication, finance, healthcare, and education, democratizing access to information and connecting global communities. As software systems become increasingly complex, the need for efficient, secure, and reliable software analysis tools becomes paramount.

The thesis focuses on improving the actionability and scalability of software analysis by integrating machine learning (ML) techniques. Traditional static analysis tools often struggle with large codebases, leading to high false positive rates and high computational costs. Machine learning, particularly deep learning architectures like Transformers, offers a promising solution by capturing long-range dependencies in code and learning hierarchical representations. This capability enables ML models to automate tasks such as bug detection, source code summarization, and program repair, providing developers with actionable insights and improving overall productivity and code quality.

A significant contribution of this thesis is the development of ML-based techniques for type inference in Python and call graph pruning. An ML-based type inference approach, namely Type4Py, was proposed, which accurately predicts type annotations for Python code, enhancing code quality and reducing runtime errors. ML models with conservative pruning strategies were proposed for call graph pruning, which learns from dynamic traces obtained by executing programs to identify and eliminate false edges, thereby minimizing false positives and improving precision. Additionally, the thesis explores the application of call graphs in vulnerability analysis, demonstrating that granular assessments provide more accurate and actionable insights than more straightforward, dependency-level analyses.

In summary, this thesis advances the field of software analysis by harnessing machine learning to address two important issues related to the actionability and scalability of software analysis tools. The proposed ML-driven tools and techniques enhance the precision and reliability of software analysis and support developers in maintaining robust, secure, and maintainable software systems. These contributions pave the way for future research in applying ML techniques to various aspects of software engineering, promising further improvements in software development practices. ...
Doctoral thesis (2025) - E. Kula, A. van Deursen, G. Gousios
Late deliveries have been a common problem in the software industry for decades. They often result from deficiencies in effort estimation and project planning. These deficiencies arise due to the complexity of software development, where various social and technical factors affect project effort and scheduling. Variability in human elements, such as team dynamics and changing user requirements, adds further uncertainty. Since meeting time and cost estimates is crucial for project success, improving effort estimation and planning remains a key priority for software organizations. More accurate forecasting enables better resource allocation, reduces delays, and enhances customer satisfaction.

Over the past two decades, software organizations have increasingly adopted agile methods to improve flexibility and responsiveness. However, despite these advantages, schedule delays remain common, with nearly half of agile projects experiencing overruns of 25% or more. A key challenge lies in balancing the flexible, short-term planning of small functionalities (user stories) with the structured, long-term planning required for larger development units (epics). Current industry practices offer limited support for managing these complexities, especially in large-scale agile settings.

This thesis presents a novel suite of expert- and data-based strategies to improve effort estimation and planning in large-scale agile software development. We conduct a series of case studies at ING, a large Dutch internationally operating bank, to collect and analyze data from hundreds of agile teams and projects. We identify key factors influencing delays in epics and user stories and develop models to predict delays at both levels. At the epic level, we compile our findings into a conceptual framework representing influential factors and their relationships to on-time delivery. Additionally, we explore dynamic Bayesian methods to continuously update delay predictions throughout an epic's development life cycle. At the story level, we examine how team characteristics affect the likelihood of delays. We also investigate how these factors, combined with incremental learning methods, can improve story delay predictions. Finally, we develop a model that optimizes sprint plans based on team goals and delivery performance.

Our research identifies 25 factors and their interactions that affect the on-time delivery of epics. The most influential factors are predominantly social in nature, such as task dependencies, organizational alignment, and internal politics. These factors interact hierarchically: organizational factors shape team behavior, which in turn affects technical factors. To capture these complexities, we demonstrate that dynamic Bayesian methods, using delay patterns as input, effectively update delay predictions as new information becomes available. At the story level, our findings suggest that planning in agile settings can be significantly improved by integrating team-related information and incremental learning methods into predictive models. Moreover, we find that user story prioritization depends on a combination of factors that vary by project context. Our sprint plan optimization model effectively addresses this variability and generates plans that deliver more business value, align more closely with sprint goals, and mitigate delay risks better. ...

Enhancing consumer-facing code completion with low-cost general enhancements

Master thesis (2024) - T.O. van Dam, M. Izadi, A. van Deursen, Egor Bogomolov, J. Yang
Master thesis (2024) - P.M. de Bekker, M. Izadi, A. van Deursen, M.S. Pera
Artificial Intelligence (AI) has rapidly advanced, significantly impacting software engineering through AI-driven tools like ChatGPT and Copilot. These tools, which have garnered substantial commercial interest, rely heavily on the performance of their underlying models, assessed via benchmarks. However, the current focus on performance scores has often overshadowed the quality and rigor of these benchmarks, as emphasized by the absence of studies on this topic. This thesis addresses this gap by reviewing and improving benchmarking practices in the field of AI for software engineering (AI4SE).

First, a categorized overview and analysis of nearly a hundred prominent AI4SE benchmarks from the past decade are provided. Based on this analysis, several challenges and future directions are identified and discussed, including quality control, programming and natural language diversity, task diversity, purpose alignment, and evaluation metrics. Lastly, a significant contribution of this work is the introduction of HumanEvalPro, an enhanced version of the original HumanEval benchmark. HumanEvalPro incorporates more rigorous test cases and edge cases, providing a more accurate and challenging assessment of model performance. The findings demonstrate substantial drops in pass@1 scores for various large language models, highlighting the necessity for well-maintained and comprehensive benchmarks.

This thesis aims to set a new standard for AI4SE benchmarks, providing a foundation for future research and development in this rapidly evolving field. ...

Building and evaluating an LLM-based code completion plugin for JetBrains IDEs

Master thesis (2024) - F.N.M. van der Heijden, A. van Deursen, M. Izadi, U.K. Gadiraju, S. Titov, A. Sergeyuk