MI

M. Izadi

info

Please Note

45 records found

Refactoring is a critical part of the software development lifecycle, and identifier renaming accounts for roughly 15% of all agentic refactoring work driven by large language models. Yet the dominant model families fit the task poorly. Autoregressive decoders generate left to right, and even with the fill-in-the-middle extension they resolve masked positions one at a time, so a renaming decision at one site cannot inform a decision at another. Identifier renaming, however, demands consistency across every affected site at once. Diffusion Large Language Models (dLLMs) generate by iteratively denoising a masked sequence under full bidirectional attention, with every prediction conditioned on every other. This matches what renaming needs: if a poorly named identifier is viewed as a small amount of semantic noise overlaid on correct code, then renaming becomes a targeted denoising task that can be solved jointly across all affected sites.

We instantiate this view as CoReFusion, the first systematic study of dLLMs on Java identifier renaming, and benchmark them against twelve decoder-only FIM-AR baselines and five encoder-decoder Seq2Seq baselines on the RefineID dataset. DreamCoder-7B and DiffuCoder-7B reach 33.2% and 31.1% Exact Match, beating the best non-dLLM model (CodeT5-large) by more than ten points while being roughly nine times smaller than the largest FIM-AR baseline. The advantage grows with the number of identifiers that must be renamed together: FIM-AR models win the single-site case, but dLLMs pull ahead as soon as the task involves more than one site. When the same dLLMs must instead find the positions on their own, Exact Match drops to about 3%, and most wrong predictions copy the lexical style of the surrounding code rather than improve on it. Probing the internal states of DiffuCoder-7B shows why: the signal that tells a bad name from a good one appears only in the last few layers and the last few denoising steps, after the unmasking schedule has already confirmed most of its predictions. Providing the rename positions as masks bypasses this timing problem, which is why dLLMs work as filling engines but not as standalone refactoring agents. ...
Master thesis (2026) - K. Hoxha, M. Izadi, Oguzhan Yildiz, B. Özkan, P.K. Murukannaiah
Repository-level code generation remains difficult in industrial systems because tasks span multiple files, internal APIs, architectural conventions, tests, and quality constraints. We present CoCA (Copilot-Orchestrated Contextual Agents), an IDE-constrained framework currently instantiated for Java repositories that extends GitHub Copilot Chat with task decomposition, deterministic repository-context retrieval, optional Test-Driven Generation, and persistent domain-context injection for enterprise settings where external embeddings, fine-tuning, and third-party LLM services are not permitted.

We evaluate CoCA at ASML using CoCABench, an internal suite with a long-horizon task focus composed of 5 epics from 2 proprietary Java repositories with 44 developer-identified subtasks, ranging from a 2-day bug fix to 3-month feature work. Full CoCA is associated with higher ground-truth alignment than the single-agent baseline, from 0.25 to 0.44, on the LLM-judge metric with the strongest inter-rater reliability (Krippendorff's α=0.46). However, it achieves only 0.20 pass@1 despite 0.60 build@1, while the single-agent baseline achieves the highest pass@1.

These research findings suggest that IDE-constrained agentic workflows can move generated implementations closer to the intended developer solution, but do not yet solve reliable executable integration. CoCA is therefore best understood as a developer-in-the-loop assistance workflow rather than a fully autonomous implementation system or a replacement for direct Copilot prompting. It appears most appropriate for long, integration-heavy feature epics where planning, context continuity, and repository awareness are valuable. For small localized fixes, the orchestration overhead may outweigh these gains. ...
Master thesis (2026) - I. Joshi, M. Izadi, R.M. Popescu, B. Özkan, M.A. Migut
The rapid adoption of autonomous coding agents raises a practical question for developers: is agent-authored code maintainable after merge? We present a large-scale empirical study of agent- and human-authored pull requests in open-source GitHub repositories, focusing on refactoring and maintainability. We construct a novel dataset of 4,392,818 agent-authored and 517,880 human-authored pull requests from 863,819 repositories, spanning 10 agents and 4 programming languages: C++, Java, JavaScript, and Python. Using a subset of 321,986 pull requests, we compare refactoring behavior, code smells, and maintainability metrics between agent- and human-authored contributions. We further examine how these outcomes vary across languages, repository popularity, and domains, and track post-merge evolution from 3 days to 2 months after merge to assess whether maintainability-related effects persist over time.

Our results show that agent-authored pull requests refactor less frequently and less diversely than human-authored pull requests, but their refactorings tend to affect larger code regions, especially in less popular repositories. Maintainability outcomes are mixed: agent-modified code is more likely to contain code smells after merge, while median metric changes remain context-dependent and broadly comparable to human-authored code. Longitudinally, agent-modified code shows similar maintainability trends after the early post-merge period, although agent-modified regions are revisited more frequently. ...
Large Language Models (LLMs) for code are trained on large amounts of data that may contain copyrighted and licensed content, which motivates internal auditing methods that can test whether specific data points were included during training. In this work we conduct an exploratory evaluation of membership inference attacks (MIAs) as auditing signals for code-specialized LLMs. We compare a loss-based baseline to Polarized Augment Calibration (PAC) across three open models in the 3B--4B range (Mellum-4B, StarCoder2-3B, and SmolLM3-3B) using the Java subset of a contamination-controlled evaluation dataset. We find that PAC provides consistent improvements over the loss signal on the code models, while near-members samples are detected almost as effectively as exact members. A stratified analysis shows that attack performance varies substantially with file properties, with strongest separability on small-to-medium files and on code with higher alphanumeric content, and degradation on very large files. Motivated by the syntactic fragility of token-swap augmentation on code, we propose PAC-AST, an AST-guided augmentation scheme that generates syntactically valid neighbors. PAC-AST exhibits improved behavior on larger and syntactically complex files where token-swap PAC degrades but underperforms in smaller and alphanumeric-rich strata due in part to a reduced effective mutation magnitude. Overall, the results indicate that (i) calibration-based signals can strengthen grey-box auditing for code models, (ii) dataset and program characteristics are major drivers of measured leakage, and (iii) code-specific augmentation is a promising direction but requires controlling perturbation magnitude and neighbor quality to yield stable gains.

https://zenodo.org/records/18367988
https://doi.org/10.5281/zenodo.18367987
...
Code language models are pretrained on massive datasets scraped from public repositories which are rarely disclosed. Membership Inference Attacks (MIAs) aim to predict whether specific samples were used in training but attack performance is contested. Previous work has shown that many attacks on LLMs perform randomly when evaluated on independent and identically distributed (i.i.d.) members and non-members. We consider three MIAs: LOSS, MinK\%, and SURP (where each attack extends the last with additional filtering of tokens considered for the membership signal), on StarCoder2-3B and Mellum-4B using the AISE MIA dataset, which contains 100,000 Java files with verified membership labels. We address a gap in the evaluation of these attacks on i.i.d. code samples and in the detailed comparison of SURP and MinK\%. A bag-of-words (BoW) classifier is used to measure distribution shift with an expected ROC-AUC of 0.5 under i.i.d. conditions. We achieve a ROC-AUC of 0.91 confirming substantial distribution shift. We apply two debiasing procedures to construct evaluation subsets: Taking samples close to the BoW decision boundary reduces BoW ROC-AUC performance to 0.66, while selecting BoW misclassified samples fails to reduce shift. After debiasing, all attacks perform at or below the bag-of-words baseline, with ROC-AUC between 0.55 and 0.63 and TPR at 5\% FPR between 0.05 and 0.16; suggesting random performance under strict i.i.d conditions. Hyperparameter ablation reveals that SURP collapses to MinK\% under optimization: optimal configurations disable SURP filtering or have classification agreement exceeding 94\% excluding one outlier. These results extend prior natural language findings to code: reference-free attacks exploit distributional differences rather than detecting membership. ...

An Evaluation of the Min-K% Prob membership inference attack

Large Language Models are becoming increasingly popular in software engineering, yet the exact composition of their training data remains largely undisclosed. This opacity introduces risks regarding copyright infringement and benchmark contamination. In this work, we audit the susceptibility of different models (StarCoder2, Mellum, and SmolLM3) to Membership Inference Attacks on code files, specifically evaluating the Min-K% Prob method.

We find that this approach serves as an effective auditor, achieving ROC-AUC scores of up to 0.793, yet performance degrades as non-members become more similar to members. The classification is primarily driven by non-functional artifacts, such as license headers and package identifiers.

Furthermore, we investigate post-training quantization as an attack accelerator. We find that the membership signal remains robust even when weights are compressed from 32-bit to 4-bit precision, and the use of 16-bit Brain Float (BF16) format reduces inference latency by a factor of 6, establishing MKP as a practical tool for assessing membership in models' training sets. ...
Master thesis (2026) - A. Ţerna, A. van Deursen, M. Izadi, J. Yang, Timur Galimzyanov, Sergey Titov
Automated program repair (APR) is increasingly critical in modern software development, yet language models (LMs) often struggle to capture repository-specific conventions and constraints. Small language models (SLMs) offer a cost-effective and deployable alternative, but their performance depends heavily on high-quality domain-specific supervision. In this work, we introduce a multi-teacher distillation pipeline that generates multi-turn repair trajectories, including both successful fixes and intermediate failures, to construct rich training datasets for method-level APR. We systematically analyze the impact of dataset size, repair diversity, fine-tuning strategies, hyperparameters, and reasoning supervision, aiming to identify efficient and reliable approaches for adapting SLMs to repository-specific repair tasks.

Our experiments demonstrate that parameter-efficient fine-tuning, particularly LoRA with carefully selected adapter ranks, achieves strong performance across reasoning and non-reasoning regimes while maintaining low computational cost. Explicit reasoning supervision is not required for high repair accuracy, but it significantly reduces reasoning trace lengths and inference costs. Dataset diversity and multi-turn trajectories are key to improving generalization and bridging the gap between reasoning and non-reasoning inference. Finally, this study seeks to provide empirical insights into the practical adaptation of SLMs for repository-specific APR, evaluating how strategic choices in dataset design, lightweight fine-tuning approaches, and reasoning supervision influence performance in real-world contexts. ...
Master thesis (2025) - V.A. Pocheva, N. Yorke-Smith, M. Izadi, René van den Berg, M.A. Costea, D. Spinellis
In large-scale engineering environments, efficient issue tracking is essential for timely problem resolution and knowledge reuse. However, manual classification and association of issue reports present scalability challenges, further complicated by inconsistent annotations and the absence of semantic linking mechanisms. This project investigates the application of Natural Language Processing and Artificial Intelligence to automate multi-label classification and discover meaningful semantic associations between technical issues. Over 70 model configurations were evaluated on a real-world industrial dataset, comparing classical models with transformer-based and deep learning approaches. DistilBERT achieved the highest Recall@5 (0.93), indicating strong performance in identifying relevant categories. Classical methods, such as TF-IDF combined with Logistic Regression, also performed well, offering a computationally efficient and interpretable option. For association discovery, approaches including lexical retrieval, embedding-based similarity, clustering-based filtering, and topic modelling were assessed using both quantitative metrics and expert review. Lexical (BM25) and embedding-based (SBERT + Cosine Similarity) methods offer complementary strengths, retrieving overlapping yet distinct sets of associations. Associations identified by both models were rated as useful in over 70% of cases by domain experts, suggesting that agreement between methods may serve as an indicator of relevance. While Copilot provided consistent relevance assessments, its ratings were often higher than those provided by human evaluators and did not always reflect their detailed assessments. These findings highlight the potential of combining lexical and semantic methods with human-in-the-loop validation to support scalable and accurate industrial applicability. ...

A study conducted at the ASML leveling department

Master thesis (2025) - Y. Mundhra, M. Izadi, F.A. Kuipers, Max Valk, Lewis Binns, U.K. Gadiraju, Goran Brkic
Large Language Models (LLMs) have shown impressive performance in various domains, including software engineering. Code generation, a crucial aspect of software development, has seen significant improvements with the integration of AI tools. While existing LLMs have show very good performance in generating code for everyday tasks, their application in industrial settings and domain-specific contexts remains largely unexplored. This thesis investigates the potential of LLMs to generate code in proprietary, domain-specific environments, with a specific focus on the leveling department at ASML. The primary goal of this research is to assess the ability of LLMs to adapt to a domain they have not encountered before and to generate complex, interdependent code in a domain-specific repository. This involves evaluating the performance of LLMs in generating code that meets the specific requirements of ASML. To achieve this, the thesis investigates various prompting techniques, compares the performance of generic and code-specific LLMs, and examines the impact of model size on code generation capabilities. To evaluate the code generation capabilities of LLMs in repository-level scenarios, we introduce a new performance metric, build@k, designed to measure the effectiveness of generated code in compiling and building projects. The results showed that both prompting techniques and model size have a substantial influence on the code generation capabilities of LLMs. However, the performance difference between code-specific and generic LLMs was less pronounced and varied substantially across different model families. ...
This paper investigates the relation between the educational value of input code and the subsequent inference performance of code large language models (LLMs) on completion tasks. Results were attained using The Heap dataset and using SmolLM2, StarCoder 2 and Mellum models. Performance was measured by comparing the generated outputs with the ground truth, where high similarity indicates high performance. We analyse how factors such as language, model size, task type and granularity of educational value affect performance across educational value. We find that most factors do not have a relation with education value, as most metrics plateau except for exact-match. It is observed to have a consistent negative correlation with educational value. Additionally, a consistent turning point is seen around an educational value of 1.75, before which, performance tends to have a more positive relation with educational value. Results highlight the influence of input quality on LLM behaviour and offer insights for more effective training and evaluation strategies. ...
Large Language Models (LLMs) are increasingly integrated into development workflows for tasks such as code completion, bug fixing, and refactoring. While prior work has shown that removing low-quality data—including data smells like Self-Admitted Technical Debt (SATD)—from training data can improve model performance, the isolated effect of SATD at inference time remains unclear.

This study investigates the impact of SATD on LLM performance during code completion. Using The Heap dataset, we annotate over 5 million Java files with SATD bitmasks and construct a set of input–target pairs based on varying SATD contexts and masking strategies. Three code generation models, SmolLM2, StarCoder2, and Mellum, are evaluated on both comment and method generation tasks using standard text-based metrics and manual semantic classification.

Our results show that the presence of SATD in input has a negligible effect on generation quality. Instead, performance is primarily driven by target method length, structural complexity, and context size. We also find that metrics may misrepresent semantic correctness in the presence of non-functional elements such as comments. These findings suggest that careful control of target complexity is more critical than the presence of SATD alone when evaluating LLM performance on code. ...
As Large Language Models become an ever more integral part of Software Engineering, often assisting developers on coding tasks, the need for an unbiased evaluation of their performance on such tasks grows [1]. Data smells [2] are reported to have an impact on a Large Language Model’s ability on such tasks [ 3]. Boilerplate code is considered to be a subcategory of said smells. In this paper, we investigate a specific type of this smell, boilerplate API usage patterns. We analyze their prevalence in The Heap dataset [1] and examine how they may bias reference-based evaluation of Large Language Models on code generation tasks. Our findings show that while this data smell is relatively rare, instances containing it are significantly easier for LLMs to predict. We attribute this to partial memorization of common boilerplate patterns, which inflates perceived model performance. ...
Master thesis (2025) - R. Popescu, A. van Deursen, M. Izadi, J. Yang
The rapid rise in the popularity of large language models has highlighted the need for extensive datasets, especially for training on code. However, this growth has also raised important questions about the legal implications of using code in large language model training, particularly regarding the potential infringement of code licenses. At the same time, the availability of clean datasets for evaluating these models is becoming increasingly limited, due to a high risk of contamination which restricts the capacity for reliable research. On top of that, this requires researchers to repeatedly perform data curation steps in order to evaluate their models on downstream tasks, based on previously unseen data. This process is not only time- and resource-intensive but also introduces potential inconsistencies across studies, which can impact their reproducibility.
We address these challenges through a comprehensive licensing analysis and by developing robust datasets to support accurate and reproducible large language model evaluations. We compiled a list of 53 large language models trained on file-level code and analyzed their datasets, discovering pervasive license inconsistencies despite careful selection based on repository licenses. Our analysis, covering 514M code files, reveals 38M exact duplicates of strong copyleft code, and 171M file-leading comments, 16M of which are under copyleft licenses and another 11M discouraging unauthorized copying. To further understand the depth of non-permissive code in public training datasets, we developed StackLessV2, a strong copyleft Java dataset decontaminated against The Stack V2 to facilitate accurate model evaluations. Our results revealed that non-permissive code is also present at the near-duplication level, although, this represents a gray area in terms of legal interpretation, where the boundary between acceptable reuse and license violation is still unclear, emphasizing the need for further legal clarification. Finally, we extend on this and introduce The Heap, a large multilingual copyleft dataset covering 57 programming languages, specifically deduplicated to avoid contamination from existing open training datasets. The Heap offers a solution for conducting fair, reproducible evaluations of large language models without the significant overhead of the data curation process. ...

Enhancing consumer-facing code completion with low-cost general enhancements

Master thesis (2024) - T.O. van Dam, M. Izadi, A. van Deursen, Egor Bogomolov, J. Yang

Building and evaluating an LLM-based code completion plugin for JetBrains IDEs

Master thesis (2024) - F.N.M. van der Heijden, A. van Deursen, M. Izadi, U.K. Gadiraju, S. Titov, A. Sergeyuk
Master thesis (2024) - P.M. de Bekker, M. Izadi, A. van Deursen, M.S. Pera
Artificial Intelligence (AI) has rapidly advanced, significantly impacting software engineering through AI-driven tools like ChatGPT and Copilot. These tools, which have garnered substantial commercial interest, rely heavily on the performance of their underlying models, assessed via benchmarks. However, the current focus on performance scores has often overshadowed the quality and rigor of these benchmarks, as emphasized by the absence of studies on this topic. This thesis addresses this gap by reviewing and improving benchmarking practices in the field of AI for software engineering (AI4SE).

First, a categorized overview and analysis of nearly a hundred prominent AI4SE benchmarks from the past decade are provided. Based on this analysis, several challenges and future directions are identified and discussed, including quality control, programming and natural language diversity, task diversity, purpose alignment, and evaluation metrics. Lastly, a significant contribution of this work is the introduction of HumanEvalPro, an enhanced version of the original HumanEval benchmark. HumanEvalPro incorporates more rigorous test cases and edge cases, providing a more accurate and challenging assessment of model performance. The findings demonstrate substantial drops in pass@1 scores for various large language models, highlighting the necessity for well-maintained and comprehensive benchmarks.

This thesis aims to set a new standard for AI4SE benchmarks, providing a foundation for future research and development in this rapidly evolving field. ...
Large Language Models (LLMs) are increasingly used in software development, but their potential for misuse in generating harmful code, such as malware, raises significant concerns. We present a red-teaming approach to assess the safety and ethical alignment of LLMs in the context of code generation, in particular how it applies to the generation of malware. By developing a dataset of prompts that are likely to elicit harmful behavior from the LLMs, we aim to provide a valuable resource for benchmarking the harmlessness factor of these models. Using this dataset, we evaluate multiple state-of-the-art open-source LLMs, analyzing factors such as model size, training alignment, and prompt specificity. Our findings show that LLMs vary significantly in their likelihood to generate harmful code, depending on factors like training data, alignment techniques, and prompt specificity. Furthermore, we demonstrate that system prompts could significantly alter the model's response to potentially harmful queries. We also demonstrate the efficacy of using LLMs to evaluate the harmlessness of other LLMs' responses. This research highlights the importance of ongoing development of safety measures to mitigate the risks associated with code-generating LLMs. ...

Exploring Dangerous and Unfair Software Applications

The rapid advancement of large language models has enabled numerous innovative, but also harmful applications. It is therefore essential to create these models to behave safely and responsibly. One way to improve these models is by red teaming them. In this study, we aim to identify prompts that lead large language models to exhibit unfair or dangerous behavior in software and cybersecurity contexts. We do this by manually creating prompts and manually assessing the harmfulness of the response. Our contributions include a taxonomy of dangerous and unfair use cases of large language models for Code, a dataset of 200 prompts tested on eight models, an investigation into how expanding the prompt, and how adding a code skeleton for the model to complete changes the level of harmfulness. Among the eight models evaluated, only CodeGemma and GPT-3.5-0125 were well-aligned against our taxonomy categories. The unaligned Dolphin-Mixtral and self-aligned Starcoder 2 were notably susceptible to harmful responses across all categories. We observed that the Model Attacks category was problematic for most models. Expanding prompts increased harmful responses in the Cyber Attacks, Model Attacks, and Phishing categories but decreased them in the Biased Code Generation category. Adding a code skeleton to prompts consistently raised harmfulness across all categories. Large language model alignment still needs further improvement, so we suggest employing red teaming techniques to enhance the safety features of large language models. ...

An Exploratory Study Through Red Teaming

Bachelor thesis (2024) - B. Koc, A. Al-Kaswan, M. Izadi, A. van Deursen, K. Liang
Large Language Models (LLMs) have experienced a rapid increase in usage across numerous sectors in recent years. However, this growth brings a greater risk of misuse. This paper explores the issue of copyright infringement facilitated by LLMs in the domain of software engineering. Through the creation of a taxonomy and prompt engineering, we investigate how alignment, structure and language of prompts affect the behavior of LLMs against copyright infringing prompts, assessing their willingness to engage in copyright violation. Our findings underscore the critical role of model alignment in identifying potentially infringing inputs, irrespective of model complexity or modality. Notably, prompts that are crafted to avoid overtly malicious language, especially those that instruct the model to complete the input given, tend to yield more responses that could facilitate malicious activities. This research provides a preliminary understanding of copyright infringement by LLMs in software engineering and suggests avenues for future research. ...

Does choice of activation function matter in smaller Langaunge Models?

The rapid expansion of large language models (LLMs) driven by the transformer architecture has raised concerns about the lack of high-quality train ing data. This study investigates the role of acti vation functions in smaller-scale language models, specifically those with approximately 10M param eters, to ensure sustained progress in LLM devel opment despite data limitations. Activation func tions, crucial for neural network performance, have evolved significantly, but comprehensive compar isons under consistent conditions remain scarce, especially for smaller parameter count models. This research systematically evaluates traditional and novel activation functions, including learnable variants, and introduces the Kolmogorov-Arnold Network (KAN) to language modeling. Using Hugging Face implementations of GPT-Neo and RoBERTa models, performance impacts were as sessed through the BabyLM evaluation pipeline. The results indicate that activation functions do not significantly impact the performance of these models. Additionally, the model with the KAN network underperformed compared to models with traditional architectures in the context of this study. These findings suggest that optimizing activation functions may not be crucial for smaller language models, emphasizing the need for further research to explore other architectural improvements. ...