M. Izadi
Please Note
45 records found
1
We instantiate this view as CoReFusion, the first systematic study of dLLMs on Java identifier renaming, and benchmark them against twelve decoder-only FIM-AR baselines and five encoder-decoder Seq2Seq baselines on the RefineID dataset. DreamCoder-7B and DiffuCoder-7B reach 33.2% and 31.1% Exact Match, beating the best non-dLLM model (CodeT5-large) by more than ten points while being roughly nine times smaller than the largest FIM-AR baseline. The advantage grows with the number of identifiers that must be renamed together: FIM-AR models win the single-site case, but dLLMs pull ahead as soon as the task involves more than one site. When the same dLLMs must instead find the positions on their own, Exact Match drops to about 3%, and most wrong predictions copy the lexical style of the surrounding code rather than improve on it. Probing the internal states of DiffuCoder-7B shows why: the signal that tells a bad name from a good one appears only in the last few layers and the last few denoising steps, after the unmasking schedule has already confirmed most of its predictions. Providing the rename positions as masks bypasses this timing problem, which is why dLLMs work as filling engines but not as standalone refactoring agents. ...
We instantiate this view as CoReFusion, the first systematic study of dLLMs on Java identifier renaming, and benchmark them against twelve decoder-only FIM-AR baselines and five encoder-decoder Seq2Seq baselines on the RefineID dataset. DreamCoder-7B and DiffuCoder-7B reach 33.2% and 31.1% Exact Match, beating the best non-dLLM model (CodeT5-large) by more than ten points while being roughly nine times smaller than the largest FIM-AR baseline. The advantage grows with the number of identifiers that must be renamed together: FIM-AR models win the single-site case, but dLLMs pull ahead as soon as the task involves more than one site. When the same dLLMs must instead find the positions on their own, Exact Match drops to about 3%, and most wrong predictions copy the lexical style of the surrounding code rather than improve on it. Probing the internal states of DiffuCoder-7B shows why: the signal that tells a bad name from a good one appears only in the last few layers and the last few denoising steps, after the unmasking schedule has already confirmed most of its predictions. Providing the rename positions as masks bypasses this timing problem, which is why dLLMs work as filling engines but not as standalone refactoring agents.
We evaluate CoCA at ASML using CoCABench, an internal suite with a long-horizon task focus composed of 5 epics from 2 proprietary Java repositories with 44 developer-identified subtasks, ranging from a 2-day bug fix to 3-month feature work. Full CoCA is associated with higher ground-truth alignment than the single-agent baseline, from 0.25 to 0.44, on the LLM-judge metric with the strongest inter-rater reliability (Krippendorff's α=0.46). However, it achieves only 0.20 pass@1 despite 0.60 build@1, while the single-agent baseline achieves the highest pass@1.
These research findings suggest that IDE-constrained agentic workflows can move generated implementations closer to the intended developer solution, but do not yet solve reliable executable integration. CoCA is therefore best understood as a developer-in-the-loop assistance workflow rather than a fully autonomous implementation system or a replacement for direct Copilot prompting. It appears most appropriate for long, integration-heavy feature epics where planning, context continuity, and repository awareness are valuable. For small localized fixes, the orchestration overhead may outweigh these gains. ...
We evaluate CoCA at ASML using CoCABench, an internal suite with a long-horizon task focus composed of 5 epics from 2 proprietary Java repositories with 44 developer-identified subtasks, ranging from a 2-day bug fix to 3-month feature work. Full CoCA is associated with higher ground-truth alignment than the single-agent baseline, from 0.25 to 0.44, on the LLM-judge metric with the strongest inter-rater reliability (Krippendorff's α=0.46). However, it achieves only 0.20 pass@1 despite 0.60 build@1, while the single-agent baseline achieves the highest pass@1.
These research findings suggest that IDE-constrained agentic workflows can move generated implementations closer to the intended developer solution, but do not yet solve reliable executable integration. CoCA is therefore best understood as a developer-in-the-loop assistance workflow rather than a fully autonomous implementation system or a replacement for direct Copilot prompting. It appears most appropriate for long, integration-heavy feature epics where planning, context continuity, and repository awareness are valuable. For small localized fixes, the orchestration overhead may outweigh these gains.
Evaluating Autonomous Coding Agents for Code Refactoring and Maintainability
A Large-Scale Study of Open-Source Software
Our results show that agent-authored pull requests refactor less frequently and less diversely than human-authored pull requests, but their refactorings tend to affect larger code regions, especially in less popular repositories. Maintainability outcomes are mixed: agent-modified code is more likely to contain code smells after merge, while median metric changes remain context-dependent and broadly comparable to human-authored code. Longitudinally, agent-modified code shows similar maintainability trends after the early post-merge period, although agent-modified regions are revisited more frequently. ...
Our results show that agent-authored pull requests refactor less frequently and less diversely than human-authored pull requests, but their refactorings tend to affect larger code regions, especially in less popular repositories. Maintainability outcomes are mixed: agent-modified code is more likely to contain code smells after merge, while median metric changes remain context-dependent and broadly comparable to human-authored code. Longitudinally, agent-modified code shows similar maintainability trends after the early post-merge period, although agent-modified regions are revisited more frequently.
https://zenodo.org/records/18367988
https://doi.org/10.5281/zenodo.18367987
...
https://zenodo.org/records/18367988
https://doi.org/10.5281/zenodo.18367987
The Illusion of Ability: The Poisoned Promise of LLM Performance
An Evaluation of the Min-K% Prob membership inference attack
We find that this approach serves as an effective auditor, achieving ROC-AUC scores of up to 0.793, yet performance degrades as non-members become more similar to members. The classification is primarily driven by non-functional artifacts, such as license headers and package identifiers.
Furthermore, we investigate post-training quantization as an attack accelerator. We find that the membership signal remains robust even when weights are compressed from 32-bit to 4-bit precision, and the use of 16-bit Brain Float (BF16) format reduces inference latency by a factor of 6, establishing MKP as a practical tool for assessing membership in models' training sets. ...
We find that this approach serves as an effective auditor, achieving ROC-AUC scores of up to 0.793, yet performance degrades as non-members become more similar to members. The classification is primarily driven by non-functional artifacts, such as license headers and package identifiers.
Furthermore, we investigate post-training quantization as an attack accelerator. We find that the membership signal remains robust even when weights are compressed from 32-bit to 4-bit precision, and the use of 16-bit Brain Float (BF16) format reduces inference latency by a factor of 6, establishing MKP as a practical tool for assessing membership in models' training sets.
Our experiments demonstrate that parameter-efficient fine-tuning, particularly LoRA with carefully selected adapter ranks, achieves strong performance across reasoning and non-reasoning regimes while maintaining low computational cost. Explicit reasoning supervision is not required for high repair accuracy, but it significantly reduces reasoning trace lengths and inference costs. Dataset diversity and multi-turn trajectories are key to improving generalization and bridging the gap between reasoning and non-reasoning inference. Finally, this study seeks to provide empirical insights into the practical adaptation of SLMs for repository-specific APR, evaluating how strategic choices in dataset design, lightweight fine-tuning approaches, and reasoning supervision influence performance in real-world contexts. ...
Our experiments demonstrate that parameter-efficient fine-tuning, particularly LoRA with carefully selected adapter ranks, achieves strong performance across reasoning and non-reasoning regimes while maintaining low computational cost. Explicit reasoning supervision is not required for high repair accuracy, but it significantly reduces reasoning trace lengths and inference costs. Dataset diversity and multi-turn trajectories are key to improving generalization and bridging the gap between reasoning and non-reasoning inference. Finally, this study seeks to provide empirical insights into the practical adaptation of SLMs for repository-specific APR, evaluating how strategic choices in dataset design, lightweight fine-tuning approaches, and reasoning supervision influence performance in real-world contexts.
Gen-AI Meets Domain Expertise: LLMs for Domain Specific Code Generation
A study conducted at the ASML leveling department
This study investigates the impact of SATD on LLM performance during code completion. Using The Heap dataset, we annotate over 5 million Java files with SATD bitmasks and construct a set of input–target pairs based on varying SATD contexts and masking strategies. Three code generation models, SmolLM2, StarCoder2, and Mellum, are evaluated on both comment and method generation tasks using standard text-based metrics and manual semantic classification.
Our results show that the presence of SATD in input has a negligible effect on generation quality. Instead, performance is primarily driven by target method length, structural complexity, and context size. We also find that metrics may misrepresent semantic correctness in the presence of non-functional elements such as comments. These findings suggest that careful control of target complexity is more critical than the presence of SATD alone when evaluating LLM performance on code. ...
This study investigates the impact of SATD on LLM performance during code completion. Using The Heap dataset, we annotate over 5 million Java files with SATD bitmasks and construct a set of input–target pairs based on varying SATD contexts and masking strategies. Three code generation models, SmolLM2, StarCoder2, and Mellum, are evaluated on both comment and method generation tasks using standard text-based metrics and manual semantic classification.
Our results show that the presence of SATD in input has a negligible effect on generation quality. Instead, performance is primarily driven by target method length, structural complexity, and context size. We also find that metrics may misrepresent semantic correctness in the presence of non-functional elements such as comments. These findings suggest that careful control of target complexity is more critical than the presence of SATD alone when evaluating LLM performance on code.
We address these challenges through a comprehensive licensing analysis and by developing robust datasets to support accurate and reproducible large language model evaluations. We compiled a list of 53 large language models trained on file-level code and analyzed their datasets, discovering pervasive license inconsistencies despite careful selection based on repository licenses. Our analysis, covering 514M code files, reveals 38M exact duplicates of strong copyleft code, and 171M file-leading comments, 16M of which are under copyleft licenses and another 11M discouraging unauthorized copying. To further understand the depth of non-permissive code in public training datasets, we developed StackLessV2, a strong copyleft Java dataset decontaminated against The Stack V2 to facilitate accurate model evaluations. Our results revealed that non-permissive code is also present at the near-duplication level, although, this represents a gray area in terms of legal interpretation, where the boundary between acceptable reuse and license violation is still unclear, emphasizing the need for further legal clarification. Finally, we extend on this and introduce The Heap, a large multilingual copyleft dataset covering 57 programming languages, specifically deduplicated to avoid contamination from existing open training datasets. The Heap offers a solution for conducting fair, reproducible evaluations of large language models without the significant overhead of the data curation process. ...
We address these challenges through a comprehensive licensing analysis and by developing robust datasets to support accurate and reproducible large language model evaluations. We compiled a list of 53 large language models trained on file-level code and analyzed their datasets, discovering pervasive license inconsistencies despite careful selection based on repository licenses. Our analysis, covering 514M code files, reveals 38M exact duplicates of strong copyleft code, and 171M file-leading comments, 16M of which are under copyleft licenses and another 11M discouraging unauthorized copying. To further understand the depth of non-permissive code in public training datasets, we developed StackLessV2, a strong copyleft Java dataset decontaminated against The Stack V2 to facilitate accurate model evaluations. Our results revealed that non-permissive code is also present at the near-duplication level, although, this represents a gray area in terms of legal interpretation, where the boundary between acceptable reuse and license violation is still unclear, emphasizing the need for further legal clarification. Finally, we extend on this and introduce The Heap, a large multilingual copyleft dataset covering 57 programming languages, specifically deduplicated to avoid contamination from existing open training datasets. The Heap offers a solution for conducting fair, reproducible evaluations of large language models without the significant overhead of the data curation process.
Black-box context-aware code completion
Enhancing consumer-facing code completion with low-cost general enhancements
Interactive & Adaptive LLMs
Building and evaluating an LLM-based code completion plugin for JetBrains IDEs
First, a categorized overview and analysis of nearly a hundred prominent AI4SE benchmarks from the past decade are provided. Based on this analysis, several challenges and future directions are identified and discussed, including quality control, programming and natural language diversity, task diversity, purpose alignment, and evaluation metrics. Lastly, a significant contribution of this work is the introduction of HumanEvalPro, an enhanced version of the original HumanEval benchmark. HumanEvalPro incorporates more rigorous test cases and edge cases, providing a more accurate and challenging assessment of model performance. The findings demonstrate substantial drops in pass@1 scores for various large language models, highlighting the necessity for well-maintained and comprehensive benchmarks.
This thesis aims to set a new standard for AI4SE benchmarks, providing a foundation for future research and development in this rapidly evolving field. ...
First, a categorized overview and analysis of nearly a hundred prominent AI4SE benchmarks from the past decade are provided. Based on this analysis, several challenges and future directions are identified and discussed, including quality control, programming and natural language diversity, task diversity, purpose alignment, and evaluation metrics. Lastly, a significant contribution of this work is the introduction of HumanEvalPro, an enhanced version of the original HumanEval benchmark. HumanEvalPro incorporates more rigorous test cases and edge cases, providing a more accurate and challenging assessment of model performance. The findings demonstrate substantial drops in pass@1 scores for various large language models, highlighting the necessity for well-maintained and comprehensive benchmarks.
This thesis aims to set a new standard for AI4SE benchmarks, providing a foundation for future research and development in this rapidly evolving field.
Red Teaming Large Language Models for Code
Exploring Dangerous and Unfair Software Applications
Implications of LLMs4Code on Copyright Infringement
An Exploratory Study Through Red Teaming
Evaluating Adaptive Activation Functions in Language Models
Does choice of activation function matter in smaller Langaunge Models?