A. van Deursen | TU Delft Repository

Monitoring and Maintaining Machine Learning Models Against Concept Drift in the Context of AIOps Systems

Doctoral thesis (2026) - L. Poenaru-Olaru , A. van Deursen , J.S. Rellermeyer , L. Miranda da Cruz

The adoption of AI systems across various sectors has increased considerably in recent years. This is a consequence of the remarkable capability of AI to extract insights from large-scale datasets, improve personalization, automate tasks and complex processes within organizations ...

Counterfactual Explanations and Algorithmic Recourse for Trustworthy AI

Doctoral thesis (2026) - P. Altmeyer , C.C.S. Liem , A. van Deursen

Many of the most celebrated recent advances in artificial intelligence (AI) have been built on the back of highly complex and opaque models that need little human oversight to achieve strong predictive performance. But while their capacity to recognize patterns from raw data is i ...

Many of the most celebrated recent advances in artificial intelligence (AI) have been built on the back of highly complex and opaque models that need little human oversight to achieve strong predictive performance. But while their capacity to recognize patterns from raw data is impressive, their decision-making process is neither robust nor well understood. This has so far inhibited trust and widespread adoption of these technologies. This thesis contributes to research efforts aimed at tackling these challenges, through interdisciplinary insights and methodological contributions.

The principle goal of this work is to contribute methods that help us in making opaque AI models more trustworthy. Specifically, we aim to (1) explore and challenge existing technologies and paradigms in the field; (2) improve our ability to hold opaque models accountable through thorough scrutiny; and, (3) leverage the results of such scrutiny during training to improve the trustworthiness of models. Methodologically, the thesis focuses on counterfactual explanations and algorithmic recourse for individuals subjected to opaque AI systems. We explore what type of real-world dynamics can be expected to play out when recourse is provided and implemented in practice. Based on our finding that individual cost minimization–a core objective in recourse–neglects hidden external costs of recourse itself, we revisit yet another established objective: namely, that explanations should be plausible first and foremost. Our work demonstrates that a narrow focus on this objective can mislead us into trusting fundamentally untrustworthy systems. To avoid this scenario, we propose a novel method that aids us in disclosing explanations that are maximally faithful, that is consistent with the behavior of models. This not only allows us to assess the trustworthiness of models, but also improve it: we show that faithful explanations can be used during training to ensure that models learn plausible explanations.

Finally, we also critically assess efforts towards trustworthy AI in the context of modern large language models (LLM). Specifically, we cast doubt on recent findings and practices presented in the field of mechanistic interpretability and caution our fellow researchers in this space against misinterpreting and inflating their findings.

In summary, this thesis makes cutting-edge research contributions that improve our ability to make opaque AI models more trustworthy. Beyond our core research contributions, this thesis makes substantial contributions to open-source software. Through various software packages that we have developed, we make our research and that of others more accessible.

The Status of JavaScript Test Generation: A Benchmark-Based Evaluation

Master thesis (2025) - S.R. Sunnevudóttir , A. van Deursen , M.J.G. Olsthoorn , Pouria Derakhshanfar , M.A. Costea

Automated test generation is a critical area of research in software engineering, aiming to reduce manual effort while improving software reliability. While substantial work has focused on statically typed languages, dynamically typed languages such as JavaScript remain underexpl ...

Analyzing the Impact of Self-Admitted Technical Debt on the Code Completion Performance of Large Language Models

Bachelor thesis (2025) - L.C. Witte , A. van Deursen , M. Izadi , J.B. Katzy , R.M. Popescu , A. Anand

Large Language Models (LLMs) are increasingly integrated into development workflows for tasks such as code completion, bug fixing, and refactoring. While prior work has shown that removing low-quality data—including data smells like Self-Admitted Technical Debt (SATD)—from traini ...

Data Hound: Linking Educational Value to LLM Code Completion Performance During Inference

Bachelor thesis (2025) - B.R.M. Annink , A. van Deursen , M. Izadi , J.B. Katzy , R.M. Popescu , A. Anand

This paper investigates the relation between the educational value of input code and the subsequent inference performance of code large language models (LLMs) on completion tasks. Results were attained using The Heap dataset and using SmolLM2, StarCoder 2 and Mellum models. Perfo ...

Data Hound: Analyzing Boilerplate Code Data Smell on Large Code Datasets

Bachelor thesis (2025) - S.A. Minkov , A. van Deursen , M. Izadi , J.B. Katzy , R.M. Popescu

As Large Language Models become an ever more integral part of Software Engineering, often assisting developers on coding tasks, the need for an unbiased evaluation of their performance on such tasks grows [1]. Data smells [2] are reported to have an impact on a Large Language Mod ...

Evaluating the Impact of Gate Errors on a Quantum-Aided Byzantine Agreement Protocol

Bachelor thesis (2025) - J.K. Wierzbicki , T.J. Coopmans , A. van Deursen

The Byzantine agreement problem in computer science focuses on honest parties trying to achieve consensus in a network with malicious actors. The performance of a quantum-aided Byzantine agreement protocol was evaluated under more realistic noise conditions, with a particular foc ...

LLM-Driven Synthesis of Concurrent Data Structures with SMR under Weak Memory

Master thesis (2025) - A. Dumitriu , A. van Deursen , S.S. Chakraborty

Developing correct concurrent data structures under weak memory models presents significant challenges due to subtle concurrency errors arising from relaxed ordering guarantees and complexities in Safe Memory Reclamation. Existing synthesis methods largely assume sequential consi ...

AI-Powered Delay Prediction for Portfolio Management

Master thesis (2025) - G.C. dos Santos Rocha , A. van Deursen , D.M. van Solingen , U.K. Gadiraju

Delayed software projects are one of the biggest threats to the integrity of many project portfolios. If portfolio managers were able to foresee delays, they could better manage risks, make adjustments to the planning and reduce delay propagation. In their 2023 paper "Dynamic Pre ...

Reducing Carbon Emissions of Code Generation in Large Language Models with Line-level Completions

Master thesis (2025) - T.J. Nulle , A. van Deursen , L. Cruz , J. Yang

This thesis investigates reducing carbon emissions in code generation using large language models (LLMs) by comparing function-level and line-level code completions across models of different sizes (1.5B and 9B parameters). The study utilises the BigCodeBench dataset, comprising ...

Dataset Development for LLMs4Code: Licensing, Contamination, and Reproducibility Challenges

Master thesis (2025) - R. Popescu , A. van Deursen , M. Izadi , J. Yang

The rapid rise in the popularity of large language models has highlighted the need for extensive datasets, especially for training on code. However, this growth has also raised important questions about the legal implications of using code in large language model training, partic ...

The rapid rise in the popularity of large language models has highlighted the need for extensive datasets, especially for training on code. However, this growth has also raised important questions about the legal implications of using code in large language model training, particularly regarding the potential infringement of code licenses. At the same time, the availability of clean datasets for evaluating these models is becoming increasingly limited, due to a high risk of contamination which restricts the capacity for reliable research. On top of that, this requires researchers to repeatedly perform data curation steps in order to evaluate their models on downstream tasks, based on previously unseen data. This process is not only time- and resource-intensive but also introduces potential inconsistencies across studies, which can impact their reproducibility.
We address these challenges through a comprehensive licensing analysis and by developing robust datasets to support accurate and reproducible large language model evaluations. We compiled a list of 53 large language models trained on file-level code and analyzed their datasets, discovering pervasive license inconsistencies despite careful selection based on repository licenses. Our analysis, covering 514M code files, reveals 38M exact duplicates of strong copyleft code, and 171M file-leading comments, 16M of which are under copyleft licenses and another 11M discouraging unauthorized copying. To further understand the depth of non-permissive code in public training datasets, we developed StackLessV2, a strong copyleft Java dataset decontaminated against The Stack V2 to facilitate accurate model evaluations. Our results revealed that non-permissive code is also present at the near-duplication level, although, this represents a gray area in terms of legal interpretation, where the boundary between acceptable reuse and license violation is still unclear, emphasizing the need for further legal clarification. Finally, we extend on this and introduce The Heap, a large multilingual copyleft dataset covering 57 programming languages, specifically deduplicated to avoid contamination from existing open training datasets. The Heap offers a solution for conducting fair, reproducible evaluations of large language models without the significant overhead of the data curation process.

Addressing Test Flakiness: Practical Approaches in a Database-Reliant Industrial System

Flaky Tests at Exact

Master thesis (2025) - G.J.B. Vegelien , A. van Deursen , C.E. Brandt , Bas Graaf , A. Katsifodimos

In today’s rapidly evolving software landscape, where continuous integration and continuous delivery are paramount, the presence of flaky tests poses a significant obstacle. These tests, exhibiting unpredictable pass/fail behavior, hinder development progress, waste valuable reso ...

Programming and executing applications on quantum network nodes

Doctoral thesis (2025) - B. van der Vecht , S.D.C. Wehner , A. van Deursen

Modeling Effort Estimation and Planning in Large-Scale Agile Software Development

Doctoral thesis (2025) - E. Kula , A. van Deursen , G. Gousios

Late deliveries have been a common problem in the software industry for decades. They often result from deficiencies in effort estimation and project planning. These deficiencies arise due to the complexity of software development, where various social and technical factors affec ...

Late deliveries have been a common problem in the software industry for decades. They often result from deficiencies in effort estimation and project planning. These deficiencies arise due to the complexity of software development, where various social and technical factors affect project effort and scheduling. Variability in human elements, such as team dynamics and changing user requirements, adds further uncertainty. Since meeting time and cost estimates is crucial for project success, improving effort estimation and planning remains a key priority for software organizations. More accurate forecasting enables better resource allocation, reduces delays, and enhances customer satisfaction.

Over the past two decades, software organizations have increasingly adopted agile methods to improve flexibility and responsiveness. However, despite these advantages, schedule delays remain common, with nearly half of agile projects experiencing overruns of 25% or more. A key challenge lies in balancing the flexible, short-term planning of small functionalities (user stories) with the structured, long-term planning required for larger development units (epics). Current industry practices offer limited support for managing these complexities, especially in large-scale agile settings.

This thesis presents a novel suite of expert- and data-based strategies to improve effort estimation and planning in large-scale agile software development. We conduct a series of case studies at ING, a large Dutch internationally operating bank, to collect and analyze data from hundreds of agile teams and projects. We identify key factors influencing delays in epics and user stories and develop models to predict delays at both levels. At the epic level, we compile our findings into a conceptual framework representing influential factors and their relationships to on-time delivery. Additionally, we explore dynamic Bayesian methods to continuously update delay predictions throughout an epic's development life cycle. At the story level, we examine how team characteristics affect the likelihood of delays. We also investigate how these factors, combined with incremental learning methods, can improve story delay predictions. Finally, we develop a model that optimizes sprint plans based on team goals and delivery performance.

Our research identifies 25 factors and their interactions that affect the on-time delivery of epics. The most influential factors are predominantly social in nature, such as task dependencies, organizational alignment, and internal politics. These factors interact hierarchically: organizational factors shape team behavior, which in turn affects technical factors. To capture these complexities, we demonstrate that dynamic Bayesian methods, using delay patterns as input, effectively update delay predictions as new information becomes available. At the story level, our findings suggest that planning in agile settings can be significantly improved by integrating team-related information and incremental learning methods into predictive models. Moreover, we find that user story prioritization depends on a combination of factors that vary by project context. Our sprint plan optimization model effectively addresses this variability and generates plans that deliver more business value, align more closely with sprint goals, and mitigate delay risks better.

Machine Learning-assisted Software Analysis

Doctoral thesis (2025) - S.A.M. Mir , A. van Deursen , S. Proksch

Software engineering, fundamental to modern technological advancement, profoundly influences various aspects of society by enhancing efficiency, accessibility, and security. This discipline involves systematically applying engineering principles to software systems' design, devel ...

Software engineering, fundamental to modern technological advancement, profoundly influences various aspects of society by enhancing efficiency, accessibility, and security. This discipline involves systematically applying engineering principles to software systems' design, development, testing, and maintenance. Innovations in software engineering have revolutionized industries such as communication, finance, healthcare, and education, democratizing access to information and connecting global communities. As software systems become increasingly complex, the need for efficient, secure, and reliable software analysis tools becomes paramount.

The thesis focuses on improving the actionability and scalability of software analysis by integrating machine learning (ML) techniques. Traditional static analysis tools often struggle with large codebases, leading to high false positive rates and high computational costs. Machine learning, particularly deep learning architectures like Transformers, offers a promising solution by capturing long-range dependencies in code and learning hierarchical representations. This capability enables ML models to automate tasks such as bug detection, source code summarization, and program repair, providing developers with actionable insights and improving overall productivity and code quality.

A significant contribution of this thesis is the development of ML-based techniques for type inference in Python and call graph pruning. An ML-based type inference approach, namely Type4Py, was proposed, which accurately predicts type annotations for Python code, enhancing code quality and reducing runtime errors. ML models with conservative pruning strategies were proposed for call graph pruning, which learns from dynamic traces obtained by executing programs to identify and eliminate false edges, thereby minimizing false positives and improving precision. Additionally, the thesis explores the application of call graphs in vulnerability analysis, demonstrating that granular assessments provide more accurate and actionable insights than more straightforward, dependency-level analyses.

In summary, this thesis advances the field of software analysis by harnessing machine learning to address two important issues related to the actionability and scalability of software analysis tools. The proposed ML-driven tools and techniques enhance the precision and reliability of software analysis and support developers in maintaining robust, secure, and maintainable software systems. These contributions pave the way for future research in applying ML techniques to various aspects of software engineering, promising further improvements in software development practices.

Black-box context-aware code completion

Enhancing consumer-facing code completion with low-cost general enhancements

Master thesis (2024) - T.O. van Dam , M. Izadi , A. van Deursen , Egor Bogomolov , J. Yang

AI for Software Engineering: Reviewing and Improving Benchmarking Practices

Master thesis (2024) - P.M. de Bekker , M. Izadi , A. van Deursen , M.S. Pera

Artificial Intelligence (AI) has rapidly advanced, significantly impacting software engineering through AI-driven tools like ChatGPT and Copilot. These tools, which have garnered substantial commercial interest, rely heavily on the performance of their underlying models, assessed ...

Interactive & Adaptive LLMs

Building and evaluating an LLM-based code completion plugin for JetBrains IDEs

Master thesis (2024) - F.N.M. van der Heijden , A. van Deursen , M. Izadi , U.K. Gadiraju , S. Titov , A. Sergeyuk

Implications of LLMs4Code on Copyright Infringement

An Exploratory Study Through Red Teaming

Bachelor thesis (2024) - B. Koc , A. Al-Kaswan , M. Izadi , A. van Deursen , K. Liang

Large Language Models (LLMs) have experienced a rapid increase in usage across numerous sectors in recent years. However, this growth brings a greater risk of misuse. This paper explores the issue of copyright infringement facilitated by LLMs in the domain of software engineering ...

Red-Teaming Code LLMs for Malware Generation

Bachelor thesis (2024) - C. Ionescu , A. van Deursen , M. Izadi , A. Al-Kaswan , K. Liang

Large Language Models (LLMs) are increasingly used in software development, but their potential for misuse in generating harmful code, such as malware, raises significant concerns. We present a red-teaming approach to assess the safety and ethical alignment of LLMs in the context ...