JK
J.B. Katzy
info
Please Note
<p>This page displays the records of the person named above and is not linked to a unique person identifier. This record may need to be merged to a profile.</p>
17 records found
1
Refactoring is a critical part of the software development lifecycle, and identifier renaming accounts for roughly 15% of all agentic refactoring work driven by large language models. Yet the dominant model families fit the task poorly. Autoregressive decoders generate left to right, and even with the fill-in-the-middle extension they resolve masked positions one at a time, so a renaming decision at one site cannot inform a decision at another. Identifier renaming, however, demands consistency across every affected site at once. Diffusion Large Language Models (dLLMs) generate by iteratively denoising a masked sequence under full bidirectional attention, with every prediction conditioned on every other. This matches what renaming needs: if a poorly named identifier is viewed as a small amount of semantic noise overlaid on correct code, then renaming becomes a targeted denoising task that can be solved jointly across all affected sites.
We instantiate this view as CoReFusion, the first systematic study of dLLMs on Java identifier renaming, and benchmark them against twelve decoder-only FIM-AR baselines and five encoder-decoder Seq2Seq baselines on the RefineID dataset. DreamCoder-7B and DiffuCoder-7B reach 33.2% and 31.1% Exact Match, beating the best non-dLLM model (CodeT5-large) by more than ten points while being roughly nine times smaller than the largest FIM-AR baseline. The advantage grows with the number of identifiers that must be renamed together: FIM-AR models win the single-site case, but dLLMs pull ahead as soon as the task involves more than one site. When the same dLLMs must instead find the positions on their own, Exact Match drops to about 3%, and most wrong predictions copy the lexical style of the surrounding code rather than improve on it. Probing the internal states of DiffuCoder-7B shows why: the signal that tells a bad name from a good one appears only in the last few layers and the last few denoising steps, after the unmasking schedule has already confirmed most of its predictions. Providing the rename positions as masks bypasses this timing problem, which is why dLLMs work as filling engines but not as standalone refactoring agents. ...
We instantiate this view as CoReFusion, the first systematic study of dLLMs on Java identifier renaming, and benchmark them against twelve decoder-only FIM-AR baselines and five encoder-decoder Seq2Seq baselines on the RefineID dataset. DreamCoder-7B and DiffuCoder-7B reach 33.2% and 31.1% Exact Match, beating the best non-dLLM model (CodeT5-large) by more than ten points while being roughly nine times smaller than the largest FIM-AR baseline. The advantage grows with the number of identifiers that must be renamed together: FIM-AR models win the single-site case, but dLLMs pull ahead as soon as the task involves more than one site. When the same dLLMs must instead find the positions on their own, Exact Match drops to about 3%, and most wrong predictions copy the lexical style of the surrounding code rather than improve on it. Probing the internal states of DiffuCoder-7B shows why: the signal that tells a bad name from a good one appears only in the last few layers and the last few denoising steps, after the unmasking schedule has already confirmed most of its predictions. Providing the rename positions as masks bypasses this timing problem, which is why dLLMs work as filling engines but not as standalone refactoring agents. ...
Refactoring is a critical part of the software development lifecycle, and identifier renaming accounts for roughly 15% of all agentic refactoring work driven by large language models. Yet the dominant model families fit the task poorly. Autoregressive decoders generate left to right, and even with the fill-in-the-middle extension they resolve masked positions one at a time, so a renaming decision at one site cannot inform a decision at another. Identifier renaming, however, demands consistency across every affected site at once. Diffusion Large Language Models (dLLMs) generate by iteratively denoising a masked sequence under full bidirectional attention, with every prediction conditioned on every other. This matches what renaming needs: if a poorly named identifier is viewed as a small amount of semantic noise overlaid on correct code, then renaming becomes a targeted denoising task that can be solved jointly across all affected sites.
We instantiate this view as CoReFusion, the first systematic study of dLLMs on Java identifier renaming, and benchmark them against twelve decoder-only FIM-AR baselines and five encoder-decoder Seq2Seq baselines on the RefineID dataset. DreamCoder-7B and DiffuCoder-7B reach 33.2% and 31.1% Exact Match, beating the best non-dLLM model (CodeT5-large) by more than ten points while being roughly nine times smaller than the largest FIM-AR baseline. The advantage grows with the number of identifiers that must be renamed together: FIM-AR models win the single-site case, but dLLMs pull ahead as soon as the task involves more than one site. When the same dLLMs must instead find the positions on their own, Exact Match drops to about 3%, and most wrong predictions copy the lexical style of the surrounding code rather than improve on it. Probing the internal states of DiffuCoder-7B shows why: the signal that tells a bad name from a good one appears only in the last few layers and the last few denoising steps, after the unmasking schedule has already confirmed most of its predictions. Providing the rename positions as masks bypasses this timing problem, which is why dLLMs work as filling engines but not as standalone refactoring agents.
We instantiate this view as CoReFusion, the first systematic study of dLLMs on Java identifier renaming, and benchmark them against twelve decoder-only FIM-AR baselines and five encoder-decoder Seq2Seq baselines on the RefineID dataset. DreamCoder-7B and DiffuCoder-7B reach 33.2% and 31.1% Exact Match, beating the best non-dLLM model (CodeT5-large) by more than ten points while being roughly nine times smaller than the largest FIM-AR baseline. The advantage grows with the number of identifiers that must be renamed together: FIM-AR models win the single-site case, but dLLMs pull ahead as soon as the task involves more than one site. When the same dLLMs must instead find the positions on their own, Exact Match drops to about 3%, and most wrong predictions copy the lexical style of the surrounding code rather than improve on it. Probing the internal states of DiffuCoder-7B shows why: the signal that tells a bad name from a good one appears only in the last few layers and the last few denoising steps, after the unmasking schedule has already confirmed most of its predictions. Providing the rename positions as masks bypasses this timing problem, which is why dLLMs work as filling engines but not as standalone refactoring agents.
The Illusion of Ability: The Poisoned Promise of LLM Performance
An Evaluation of the Min-K% Prob membership inference attack
Large Language Models are becoming increasingly popular in software engineering, yet the exact composition of their training data remains largely undisclosed. This opacity introduces risks regarding copyright infringement and benchmark contamination. In this work, we audit the susceptibility of different models (StarCoder2, Mellum, and SmolLM3) to Membership Inference Attacks on code files, specifically evaluating the Min-K% Prob method.
We find that this approach serves as an effective auditor, achieving ROC-AUC scores of up to 0.793, yet performance degrades as non-members become more similar to members. The classification is primarily driven by non-functional artifacts, such as license headers and package identifiers.
Furthermore, we investigate post-training quantization as an attack accelerator. We find that the membership signal remains robust even when weights are compressed from 32-bit to 4-bit precision, and the use of 16-bit Brain Float (BF16) format reduces inference latency by a factor of 6, establishing MKP as a practical tool for assessing membership in models' training sets. ...
We find that this approach serves as an effective auditor, achieving ROC-AUC scores of up to 0.793, yet performance degrades as non-members become more similar to members. The classification is primarily driven by non-functional artifacts, such as license headers and package identifiers.
Furthermore, we investigate post-training quantization as an attack accelerator. We find that the membership signal remains robust even when weights are compressed from 32-bit to 4-bit precision, and the use of 16-bit Brain Float (BF16) format reduces inference latency by a factor of 6, establishing MKP as a practical tool for assessing membership in models' training sets. ...
Large Language Models are becoming increasingly popular in software engineering, yet the exact composition of their training data remains largely undisclosed. This opacity introduces risks regarding copyright infringement and benchmark contamination. In this work, we audit the susceptibility of different models (StarCoder2, Mellum, and SmolLM3) to Membership Inference Attacks on code files, specifically evaluating the Min-K% Prob method.
We find that this approach serves as an effective auditor, achieving ROC-AUC scores of up to 0.793, yet performance degrades as non-members become more similar to members. The classification is primarily driven by non-functional artifacts, such as license headers and package identifiers.
Furthermore, we investigate post-training quantization as an attack accelerator. We find that the membership signal remains robust even when weights are compressed from 32-bit to 4-bit precision, and the use of 16-bit Brain Float (BF16) format reduces inference latency by a factor of 6, establishing MKP as a practical tool for assessing membership in models' training sets.
We find that this approach serves as an effective auditor, achieving ROC-AUC scores of up to 0.793, yet performance degrades as non-members become more similar to members. The classification is primarily driven by non-functional artifacts, such as license headers and package identifiers.
Furthermore, we investigate post-training quantization as an attack accelerator. We find that the membership signal remains robust even when weights are compressed from 32-bit to 4-bit precision, and the use of 16-bit Brain Float (BF16) format reduces inference latency by a factor of 6, establishing MKP as a practical tool for assessing membership in models' training sets.
Code language models are pretrained on massive datasets scraped from public repositories which are rarely disclosed. Membership Inference Attacks (MIAs) aim to predict whether specific samples were used in training but attack performance is contested. Previous work has shown that many attacks on LLMs perform randomly when evaluated on independent and identically distributed (i.i.d.) members and non-members. We consider three MIAs: LOSS, MinK\%, and SURP (where each attack extends the last with additional filtering of tokens considered for the membership signal), on StarCoder2-3B and Mellum-4B using the AISE MIA dataset, which contains 100,000 Java files with verified membership labels. We address a gap in the evaluation of these attacks on i.i.d. code samples and in the detailed comparison of SURP and MinK\%. A bag-of-words (BoW) classifier is used to measure distribution shift with an expected ROC-AUC of 0.5 under i.i.d. conditions. We achieve a ROC-AUC of 0.91 confirming substantial distribution shift. We apply two debiasing procedures to construct evaluation subsets: Taking samples close to the BoW decision boundary reduces BoW ROC-AUC performance to 0.66, while selecting BoW misclassified samples fails to reduce shift. After debiasing, all attacks perform at or below the bag-of-words baseline, with ROC-AUC between 0.55 and 0.63 and TPR at 5\% FPR between 0.05 and 0.16; suggesting random performance under strict i.i.d conditions. Hyperparameter ablation reveals that SURP collapses to MinK\% under optimization: optimal configurations disable SURP filtering or have classification agreement exceeding 94\% excluding one outlier. These results extend prior natural language findings to code: reference-free attacks exploit distributional differences rather than detecting membership.
...
Code language models are pretrained on massive datasets scraped from public repositories which are rarely disclosed. Membership Inference Attacks (MIAs) aim to predict whether specific samples were used in training but attack performance is contested. Previous work has shown that many attacks on LLMs perform randomly when evaluated on independent and identically distributed (i.i.d.) members and non-members. We consider three MIAs: LOSS, MinK\%, and SURP (where each attack extends the last with additional filtering of tokens considered for the membership signal), on StarCoder2-3B and Mellum-4B using the AISE MIA dataset, which contains 100,000 Java files with verified membership labels. We address a gap in the evaluation of these attacks on i.i.d. code samples and in the detailed comparison of SURP and MinK\%. A bag-of-words (BoW) classifier is used to measure distribution shift with an expected ROC-AUC of 0.5 under i.i.d. conditions. We achieve a ROC-AUC of 0.91 confirming substantial distribution shift. We apply two debiasing procedures to construct evaluation subsets: Taking samples close to the BoW decision boundary reduces BoW ROC-AUC performance to 0.66, while selecting BoW misclassified samples fails to reduce shift. After debiasing, all attacks perform at or below the bag-of-words baseline, with ROC-AUC between 0.55 and 0.63 and TPR at 5\% FPR between 0.05 and 0.16; suggesting random performance under strict i.i.d conditions. Hyperparameter ablation reveals that SURP collapses to MinK\% under optimization: optimal configurations disable SURP filtering or have classification agreement exceeding 94\% excluding one outlier. These results extend prior natural language findings to code: reference-free attacks exploit distributional differences rather than detecting membership.
Large Language Models (LLMs) for code are trained on large amounts of data that may contain copyrighted and licensed content, which motivates internal auditing methods that can test whether specific data points were included during training. In this work we conduct an exploratory evaluation of membership inference attacks (MIAs) as auditing signals for code-specialized LLMs. We compare a loss-based baseline to Polarized Augment Calibration (PAC) across three open models in the 3B--4B range (Mellum-4B, StarCoder2-3B, and SmolLM3-3B) using the Java subset of a contamination-controlled evaluation dataset. We find that PAC provides consistent improvements over the loss signal on the code models, while near-members samples are detected almost as effectively as exact members. A stratified analysis shows that attack performance varies substantially with file properties, with strongest separability on small-to-medium files and on code with higher alphanumeric content, and degradation on very large files. Motivated by the syntactic fragility of token-swap augmentation on code, we propose PAC-AST, an AST-guided augmentation scheme that generates syntactically valid neighbors. PAC-AST exhibits improved behavior on larger and syntactically complex files where token-swap PAC degrades but underperforms in smaller and alphanumeric-rich strata due in part to a reduced effective mutation magnitude. Overall, the results indicate that (i) calibration-based signals can strengthen grey-box auditing for code models, (ii) dataset and program characteristics are major drivers of measured leakage, and (iii) code-specific augmentation is a promising direction but requires controlling perturbation magnitude and neighbor quality to yield stable gains.
https://zenodo.org/records/18367988
https://doi.org/10.5281/zenodo.18367987
...
https://zenodo.org/records/18367988
https://doi.org/10.5281/zenodo.18367987
...
Large Language Models (LLMs) for code are trained on large amounts of data that may contain copyrighted and licensed content, which motivates internal auditing methods that can test whether specific data points were included during training. In this work we conduct an exploratory evaluation of membership inference attacks (MIAs) as auditing signals for code-specialized LLMs. We compare a loss-based baseline to Polarized Augment Calibration (PAC) across three open models in the 3B--4B range (Mellum-4B, StarCoder2-3B, and SmolLM3-3B) using the Java subset of a contamination-controlled evaluation dataset. We find that PAC provides consistent improvements over the loss signal on the code models, while near-members samples are detected almost as effectively as exact members. A stratified analysis shows that attack performance varies substantially with file properties, with strongest separability on small-to-medium files and on code with higher alphanumeric content, and degradation on very large files. Motivated by the syntactic fragility of token-swap augmentation on code, we propose PAC-AST, an AST-guided augmentation scheme that generates syntactically valid neighbors. PAC-AST exhibits improved behavior on larger and syntactically complex files where token-swap PAC degrades but underperforms in smaller and alphanumeric-rich strata due in part to a reduced effective mutation magnitude. Overall, the results indicate that (i) calibration-based signals can strengthen grey-box auditing for code models, (ii) dataset and program characteristics are major drivers of measured leakage, and (iii) code-specific augmentation is a promising direction but requires controlling perturbation magnitude and neighbor quality to yield stable gains.
https://zenodo.org/records/18367988
https://doi.org/10.5281/zenodo.18367987
https://zenodo.org/records/18367988
https://doi.org/10.5281/zenodo.18367987
Bachelor thesis
(2025)
-
B.R.M. Annink, A. van Deursen, M. Izadi, J.B. Katzy, R.M. Popescu, A. Anand
This paper investigates the relation between the educational value of input code and the subsequent inference performance of code large language models (LLMs) on completion tasks. Results were attained using The Heap dataset and using SmolLM2, StarCoder 2 and Mellum models. Performance was measured by comparing the generated outputs with the ground truth, where high similarity indicates high performance. We analyse how factors such as language, model size, task type and granularity of educational value affect performance across educational value. We find that most factors do not have a relation with education value, as most metrics plateau except for exact-match. It is observed to have a consistent negative correlation with educational value. Additionally, a consistent turning point is seen around an educational value of 1.75, before which, performance tends to have a more positive relation with educational value. Results highlight the influence of input quality on LLM behaviour and offer insights for more effective training and evaluation strategies.
...
This paper investigates the relation between the educational value of input code and the subsequent inference performance of code large language models (LLMs) on completion tasks. Results were attained using The Heap dataset and using SmolLM2, StarCoder 2 and Mellum models. Performance was measured by comparing the generated outputs with the ground truth, where high similarity indicates high performance. We analyse how factors such as language, model size, task type and granularity of educational value affect performance across educational value. We find that most factors do not have a relation with education value, as most metrics plateau except for exact-match. It is observed to have a consistent negative correlation with educational value. Additionally, a consistent turning point is seen around an educational value of 1.75, before which, performance tends to have a more positive relation with educational value. Results highlight the influence of input quality on LLM behaviour and offer insights for more effective training and evaluation strategies.
Large Language Models (LLMs) are increasingly integrated into development workflows for tasks such as code completion, bug fixing, and refactoring. While prior work has shown that removing low-quality data—including data smells like Self-Admitted Technical Debt (SATD)—from training data can improve model performance, the isolated effect of SATD at inference time remains unclear.
This study investigates the impact of SATD on LLM performance during code completion. Using The Heap dataset, we annotate over 5 million Java files with SATD bitmasks and construct a set of input–target pairs based on varying SATD contexts and masking strategies. Three code generation models, SmolLM2, StarCoder2, and Mellum, are evaluated on both comment and method generation tasks using standard text-based metrics and manual semantic classification.
Our results show that the presence of SATD in input has a negligible effect on generation quality. Instead, performance is primarily driven by target method length, structural complexity, and context size. We also find that metrics may misrepresent semantic correctness in the presence of non-functional elements such as comments. These findings suggest that careful control of target complexity is more critical than the presence of SATD alone when evaluating LLM performance on code. ...
This study investigates the impact of SATD on LLM performance during code completion. Using The Heap dataset, we annotate over 5 million Java files with SATD bitmasks and construct a set of input–target pairs based on varying SATD contexts and masking strategies. Three code generation models, SmolLM2, StarCoder2, and Mellum, are evaluated on both comment and method generation tasks using standard text-based metrics and manual semantic classification.
Our results show that the presence of SATD in input has a negligible effect on generation quality. Instead, performance is primarily driven by target method length, structural complexity, and context size. We also find that metrics may misrepresent semantic correctness in the presence of non-functional elements such as comments. These findings suggest that careful control of target complexity is more critical than the presence of SATD alone when evaluating LLM performance on code. ...
Large Language Models (LLMs) are increasingly integrated into development workflows for tasks such as code completion, bug fixing, and refactoring. While prior work has shown that removing low-quality data—including data smells like Self-Admitted Technical Debt (SATD)—from training data can improve model performance, the isolated effect of SATD at inference time remains unclear.
This study investigates the impact of SATD on LLM performance during code completion. Using The Heap dataset, we annotate over 5 million Java files with SATD bitmasks and construct a set of input–target pairs based on varying SATD contexts and masking strategies. Three code generation models, SmolLM2, StarCoder2, and Mellum, are evaluated on both comment and method generation tasks using standard text-based metrics and manual semantic classification.
Our results show that the presence of SATD in input has a negligible effect on generation quality. Instead, performance is primarily driven by target method length, structural complexity, and context size. We also find that metrics may misrepresent semantic correctness in the presence of non-functional elements such as comments. These findings suggest that careful control of target complexity is more critical than the presence of SATD alone when evaluating LLM performance on code.
This study investigates the impact of SATD on LLM performance during code completion. Using The Heap dataset, we annotate over 5 million Java files with SATD bitmasks and construct a set of input–target pairs based on varying SATD contexts and masking strategies. Three code generation models, SmolLM2, StarCoder2, and Mellum, are evaluated on both comment and method generation tasks using standard text-based metrics and manual semantic classification.
Our results show that the presence of SATD in input has a negligible effect on generation quality. Instead, performance is primarily driven by target method length, structural complexity, and context size. We also find that metrics may misrepresent semantic correctness in the presence of non-functional elements such as comments. These findings suggest that careful control of target complexity is more critical than the presence of SATD alone when evaluating LLM performance on code.
As Large Language Models become an ever more integral part of Software Engineering, often assisting developers on coding tasks, the need for an unbiased evaluation of their performance on such tasks grows [1]. Data smells [2] are reported to have an impact on a Large Language Model’s ability on such tasks [ 3]. Boilerplate code is considered to be a subcategory of said smells. In this paper, we investigate a specific type of this smell, boilerplate API usage patterns. We analyze their prevalence in The Heap dataset [1] and examine how they may bias reference-based evaluation of Large Language Models on code generation tasks. Our findings show that while this data smell is relatively rare, instances containing it are significantly easier for LLMs to predict. We attribute this to partial memorization of common boilerplate patterns, which inflates perceived model performance.
...
As Large Language Models become an ever more integral part of Software Engineering, often assisting developers on coding tasks, the need for an unbiased evaluation of their performance on such tasks grows [1]. Data smells [2] are reported to have an impact on a Large Language Model’s ability on such tasks [ 3]. Boilerplate code is considered to be a subcategory of said smells. In this paper, we investigate a specific type of this smell, boilerplate API usage patterns. We analyze their prevalence in The Heap dataset [1] and examine how they may bias reference-based evaluation of Large Language Models on code generation tasks. Our findings show that while this data smell is relatively rare, instances containing it are significantly easier for LLMs to predict. We attribute this to partial memorization of common boilerplate patterns, which inflates perceived model performance.
Large Language Models (LLMs) are increasingly used for code-centric tasks. However, their training data often exhibits data smells that may hinder downstream quality. This research focuses on the “Uneven Natural Languages” smell and the presence of non-English text in source code and investigates its effect on LLM-based code generation and summarisation. We construct a three-stage (Detection, Generation, Evaluation) pipeline that annotates every character in a file with its predicted language using Tree-sitter, FastText, and pycld2; masks target spans via causal masking and Fill-in-the-Middle (FIM) and prompts using three chosen models (SmolLM2, StarCoder 2, and Mellum-4B). The Heap dataset is used for the pipeline; however, this research only focuses on the Java subset of the Heap.
In 3.35 million Java files, we find that English tokens account for more than 90\% of comments, strings, and identifiers, while Chinese, Spanish, Portuguese, and French form a long-tailed minority. Despite this skew, LLMs achieve marginally higher BLEU, METEOR, ROUGE, and Exact Match scores when non-English elements are present or masked. Mellum consistently yields the most fluent continuations; StarCoder 2 retains broader token recall; SmolLM2 lags on both axes, reflecting its smaller capacity.
Our publicly available code enables reproducible assessment of multilingual data smells and lays the groundwork for cleaner, language-aware pre-training corpora and more robust multilingual code assistants. ...
In 3.35 million Java files, we find that English tokens account for more than 90\% of comments, strings, and identifiers, while Chinese, Spanish, Portuguese, and French form a long-tailed minority. Despite this skew, LLMs achieve marginally higher BLEU, METEOR, ROUGE, and Exact Match scores when non-English elements are present or masked. Mellum consistently yields the most fluent continuations; StarCoder 2 retains broader token recall; SmolLM2 lags on both axes, reflecting its smaller capacity.
Our publicly available code enables reproducible assessment of multilingual data smells and lays the groundwork for cleaner, language-aware pre-training corpora and more robust multilingual code assistants. ...
Large Language Models (LLMs) are increasingly used for code-centric tasks. However, their training data often exhibits data smells that may hinder downstream quality. This research focuses on the “Uneven Natural Languages” smell and the presence of non-English text in source code and investigates its effect on LLM-based code generation and summarisation. We construct a three-stage (Detection, Generation, Evaluation) pipeline that annotates every character in a file with its predicted language using Tree-sitter, FastText, and pycld2; masks target spans via causal masking and Fill-in-the-Middle (FIM) and prompts using three chosen models (SmolLM2, StarCoder 2, and Mellum-4B). The Heap dataset is used for the pipeline; however, this research only focuses on the Java subset of the Heap.
In 3.35 million Java files, we find that English tokens account for more than 90\% of comments, strings, and identifiers, while Chinese, Spanish, Portuguese, and French form a long-tailed minority. Despite this skew, LLMs achieve marginally higher BLEU, METEOR, ROUGE, and Exact Match scores when non-English elements are present or masked. Mellum consistently yields the most fluent continuations; StarCoder 2 retains broader token recall; SmolLM2 lags on both axes, reflecting its smaller capacity.
Our publicly available code enables reproducible assessment of multilingual data smells and lays the groundwork for cleaner, language-aware pre-training corpora and more robust multilingual code assistants.
In 3.35 million Java files, we find that English tokens account for more than 90\% of comments, strings, and identifiers, while Chinese, Spanish, Portuguese, and French form a long-tailed minority. Despite this skew, LLMs achieve marginally higher BLEU, METEOR, ROUGE, and Exact Match scores when non-English elements are present or masked. Mellum consistently yields the most fluent continuations; StarCoder 2 retains broader token recall; SmolLM2 lags on both axes, reflecting its smaller capacity.
Our publicly available code enables reproducible assessment of multilingual data smells and lays the groundwork for cleaner, language-aware pre-training corpora and more robust multilingual code assistants.
This paper evaluates the performance of Large Language Models, specifically StarCoder 2, in non-English code summarization, with a focus on the Greek language. We establish a hierarchical error taxonomy through an open coding approach to enhance the understanding and improvement of Large Language Models in multilingual settings as well as identify the challenges associated with tokenization and influence by mathematical datasets. Our study includes a comprehensive analysis of error types, tokenization efficiency, and quantitative metrics such as BLEU, ROUGE, and Semantic Similarity. The findings highlight the importance of semantic similarity as a reliable performance metric and suggest the need for more inclusive tokenizers and training datasets to address the limitations and errors in non-English contexts.
...
This paper evaluates the performance of Large Language Models, specifically StarCoder 2, in non-English code summarization, with a focus on the Greek language. We establish a hierarchical error taxonomy through an open coding approach to enhance the understanding and improvement of Large Language Models in multilingual settings as well as identify the challenges associated with tokenization and influence by mathematical datasets. Our study includes a comprehensive analysis of error types, tokenization efficiency, and quantitative metrics such as BLEU, ROUGE, and Semantic Similarity. The findings highlight the importance of semantic similarity as a reliable performance metric and suggest the need for more inclusive tokenizers and training datasets to address the limitations and errors in non-English contexts.
After the emergence of BERT, Large Language Models (LLMs) have demonstrated remarkable multilingual capabilities and have seen widespread adoption globally, particularly in the field of programming. However, current evaluations and benchmarks of LLMs on code primarily focus on English use cases. In this study, we assess the performance of LLMs in generating Chinese Java code comments through open coding. Our experiments highlight the prevalence of model-specific and semantic errors in generating Chinese code comments using LLMs, while also revealing a relative absence of grammatical issues due to the unique characteristics of the Chinese language. Additionally, we validated the potential for quantitatively analyzing semantic errors, especially Hallucinations, by examining the cosine similarity of word embeddings. Our findings propose an Error Taxonomy for evaluating LLMs on code in non-English scenarios and demonstrate the possibilities of using cosine similarity of word embeddings to judge the quality of code comment generation.
...
After the emergence of BERT, Large Language Models (LLMs) have demonstrated remarkable multilingual capabilities and have seen widespread adoption globally, particularly in the field of programming. However, current evaluations and benchmarks of LLMs on code primarily focus on English use cases. In this study, we assess the performance of LLMs in generating Chinese Java code comments through open coding. Our experiments highlight the prevalence of model-specific and semantic errors in generating Chinese code comments using LLMs, while also revealing a relative absence of grammatical issues due to the unique characteristics of the Chinese language. Additionally, we validated the potential for quantitatively analyzing semantic errors, especially Hallucinations, by examining the cosine similarity of word embeddings. Our findings propose an Error Taxonomy for evaluating LLMs on code in non-English scenarios and demonstrate the possibilities of using cosine similarity of word embeddings to judge the quality of code comment generation.
This research evaluates the performance of Meta's Code Llama 7B model in generating comments for Java code written in Polish. Using a mixed-methods approach, we conduct both quantitative and qualitative methods to discover the model's accuracy and limitations. We preprocess a dataset of Polish Java code from GitHub, apply a Fill-in-the-Middle objective for code comment completion, and evaluate the results using BLEU and ROUGE-L metrics. Additionally, we manually evaluate approximately 1150 generated comments and document the encountered errors. Based on the findings, we iteratively develop a taxonomy of errors using an open coding approach.
Through an expert evaluation, we discover the limitation of the BLEU metric in assessing comment quality for non-English languages, showing substantial differences with human evaluation. Our research identifies the most frequent errors in code comment completion in Polish, which are the generation of code snippets, copying context, late termination, hallucinations and repetitions. Only 25.2% of the generated comments were classified to be correct. This study is a part of the broader research about multiple models across various non-English languages. We aim to contribute to raise the awareness of large language models for code accessibility in non-English environments, therefore improving their inclusivity.
...
Through an expert evaluation, we discover the limitation of the BLEU metric in assessing comment quality for non-English languages, showing substantial differences with human evaluation. Our research identifies the most frequent errors in code comment completion in Polish, which are the generation of code snippets, copying context, late termination, hallucinations and repetitions. Only 25.2% of the generated comments were classified to be correct. This study is a part of the broader research about multiple models across various non-English languages. We aim to contribute to raise the awareness of large language models for code accessibility in non-English environments, therefore improving their inclusivity.
...
This research evaluates the performance of Meta's Code Llama 7B model in generating comments for Java code written in Polish. Using a mixed-methods approach, we conduct both quantitative and qualitative methods to discover the model's accuracy and limitations. We preprocess a dataset of Polish Java code from GitHub, apply a Fill-in-the-Middle objective for code comment completion, and evaluate the results using BLEU and ROUGE-L metrics. Additionally, we manually evaluate approximately 1150 generated comments and document the encountered errors. Based on the findings, we iteratively develop a taxonomy of errors using an open coding approach.
Through an expert evaluation, we discover the limitation of the BLEU metric in assessing comment quality for non-English languages, showing substantial differences with human evaluation. Our research identifies the most frequent errors in code comment completion in Polish, which are the generation of code snippets, copying context, late termination, hallucinations and repetitions. Only 25.2% of the generated comments were classified to be correct. This study is a part of the broader research about multiple models across various non-English languages. We aim to contribute to raise the awareness of large language models for code accessibility in non-English environments, therefore improving their inclusivity.
Through an expert evaluation, we discover the limitation of the BLEU metric in assessing comment quality for non-English languages, showing substantial differences with human evaluation. Our research identifies the most frequent errors in code comment completion in Polish, which are the generation of code snippets, copying context, late termination, hallucinations and repetitions. Only 25.2% of the generated comments were classified to be correct. This study is a part of the broader research about multiple models across various non-English languages. We aim to contribute to raise the awareness of large language models for code accessibility in non-English environments, therefore improving their inclusivity.
LLM of Babel
An analysis of the behavior of large language models when performing Java code summarization in Dutch
How well do large language models (LLMs) infer text in a non-English context when performing code summarization? The goal of this paper was to understand the mistakes made by LLMs when performing code summarization in Dutch. We categorized the mistakes made by CodeQwen1.5-7b when inferring Java code comments in the Dutch language through an open coding methodology to create a taxonomy of errors by which to categorize these mistakes.
Dutch code comments scraped from Github were analyzed, resulting in a taxonomy that revealed four broad categories under which inference errors could be classified: Semantic, Syntactic, Linguistic, and LLM Specific. Additional analysis revealed a prevalence of semantic and LLM specific errors in the dataset compared to the other categories. The resulting taxonomy has significant overlap with other taxonomies in similar fields like machine translation and English code summarization while introducing several categories that are not prevalent in those fields. Furthermore, it was found that BLEU-1 And ROUGEL metrics were unreliable as accuracy measures in this use case due to their nature as similarity metrics. ...
Dutch code comments scraped from Github were analyzed, resulting in a taxonomy that revealed four broad categories under which inference errors could be classified: Semantic, Syntactic, Linguistic, and LLM Specific. Additional analysis revealed a prevalence of semantic and LLM specific errors in the dataset compared to the other categories. The resulting taxonomy has significant overlap with other taxonomies in similar fields like machine translation and English code summarization while introducing several categories that are not prevalent in those fields. Furthermore, it was found that BLEU-1 And ROUGEL metrics were unreliable as accuracy measures in this use case due to their nature as similarity metrics. ...
How well do large language models (LLMs) infer text in a non-English context when performing code summarization? The goal of this paper was to understand the mistakes made by LLMs when performing code summarization in Dutch. We categorized the mistakes made by CodeQwen1.5-7b when inferring Java code comments in the Dutch language through an open coding methodology to create a taxonomy of errors by which to categorize these mistakes.
Dutch code comments scraped from Github were analyzed, resulting in a taxonomy that revealed four broad categories under which inference errors could be classified: Semantic, Syntactic, Linguistic, and LLM Specific. Additional analysis revealed a prevalence of semantic and LLM specific errors in the dataset compared to the other categories. The resulting taxonomy has significant overlap with other taxonomies in similar fields like machine translation and English code summarization while introducing several categories that are not prevalent in those fields. Furthermore, it was found that BLEU-1 And ROUGEL metrics were unreliable as accuracy measures in this use case due to their nature as similarity metrics.
Dutch code comments scraped from Github were analyzed, resulting in a taxonomy that revealed four broad categories under which inference errors could be classified: Semantic, Syntactic, Linguistic, and LLM Specific. Additional analysis revealed a prevalence of semantic and LLM specific errors in the dataset compared to the other categories. The resulting taxonomy has significant overlap with other taxonomies in similar fields like machine translation and English code summarization while introducing several categories that are not prevalent in those fields. Furthermore, it was found that BLEU-1 And ROUGEL metrics were unreliable as accuracy measures in this use case due to their nature as similarity metrics.
Interest in Large Language Models is growing, especially in software development tasks such as code completion and comment generation. However, most Large Language Models are primarily trained on English language data, raising concerns about their effectiveness when applied to other languages. This research investigates the performance of CodeGemma-7B, a transformer-based model, in generating code comments in Dutch, addressing the multilingual model training and evaluation gap. Using a dataset of Java source code containing Dutch comments, we aim to assess the model's ability for non-English use cases by evaluating the comments it generates.
Our process involved several stages, starting with collecting a dataset of Java files from GitHub that included common Dutch words. We filtered and masked the dataset and inferred new comments. Additionally, we trained a custom tokenizer to investigate the potential inefficiencies of the Gemma tokenizer when applied to Dutch code. For the qualitative analysis, we employed an open coding approach to identify common errors and patterns in the generated comments. Quantitative analysis was performed using BLEU-4 and ROUGE-L scores to compare the generated comments against the original ones, considering comment and context lengths.
Qualitative analysis revealed common errors, such as syntactically correct but factually faulty statements, unintended code snippets, and linguistic errors. These findings highlight areas for improvement in factual accuracy and model biases. Quantitative results showed high similarity scores, with 26% of the comments getting a BLEU-4 score above 0.95, and 28% getting a ROUGE-L score above 0.95. Additionally, the custom tokenizer we trained showed better efficiency than the Gemma tokenizer, with our tokenizer having a 5.35% better compression factor. ...
Our process involved several stages, starting with collecting a dataset of Java files from GitHub that included common Dutch words. We filtered and masked the dataset and inferred new comments. Additionally, we trained a custom tokenizer to investigate the potential inefficiencies of the Gemma tokenizer when applied to Dutch code. For the qualitative analysis, we employed an open coding approach to identify common errors and patterns in the generated comments. Quantitative analysis was performed using BLEU-4 and ROUGE-L scores to compare the generated comments against the original ones, considering comment and context lengths.
Qualitative analysis revealed common errors, such as syntactically correct but factually faulty statements, unintended code snippets, and linguistic errors. These findings highlight areas for improvement in factual accuracy and model biases. Quantitative results showed high similarity scores, with 26% of the comments getting a BLEU-4 score above 0.95, and 28% getting a ROUGE-L score above 0.95. Additionally, the custom tokenizer we trained showed better efficiency than the Gemma tokenizer, with our tokenizer having a 5.35% better compression factor. ...
Interest in Large Language Models is growing, especially in software development tasks such as code completion and comment generation. However, most Large Language Models are primarily trained on English language data, raising concerns about their effectiveness when applied to other languages. This research investigates the performance of CodeGemma-7B, a transformer-based model, in generating code comments in Dutch, addressing the multilingual model training and evaluation gap. Using a dataset of Java source code containing Dutch comments, we aim to assess the model's ability for non-English use cases by evaluating the comments it generates.
Our process involved several stages, starting with collecting a dataset of Java files from GitHub that included common Dutch words. We filtered and masked the dataset and inferred new comments. Additionally, we trained a custom tokenizer to investigate the potential inefficiencies of the Gemma tokenizer when applied to Dutch code. For the qualitative analysis, we employed an open coding approach to identify common errors and patterns in the generated comments. Quantitative analysis was performed using BLEU-4 and ROUGE-L scores to compare the generated comments against the original ones, considering comment and context lengths.
Qualitative analysis revealed common errors, such as syntactically correct but factually faulty statements, unintended code snippets, and linguistic errors. These findings highlight areas for improvement in factual accuracy and model biases. Quantitative results showed high similarity scores, with 26% of the comments getting a BLEU-4 score above 0.95, and 28% getting a ROUGE-L score above 0.95. Additionally, the custom tokenizer we trained showed better efficiency than the Gemma tokenizer, with our tokenizer having a 5.35% better compression factor.
Our process involved several stages, starting with collecting a dataset of Java files from GitHub that included common Dutch words. We filtered and masked the dataset and inferred new comments. Additionally, we trained a custom tokenizer to investigate the potential inefficiencies of the Gemma tokenizer when applied to Dutch code. For the qualitative analysis, we employed an open coding approach to identify common errors and patterns in the generated comments. Quantitative analysis was performed using BLEU-4 and ROUGE-L scores to compare the generated comments against the original ones, considering comment and context lengths.
Qualitative analysis revealed common errors, such as syntactically correct but factually faulty statements, unintended code snippets, and linguistic errors. These findings highlight areas for improvement in factual accuracy and model biases. Quantitative results showed high similarity scores, with 26% of the comments getting a BLEU-4 score above 0.95, and 28% getting a ROUGE-L score above 0.95. Additionally, the custom tokenizer we trained showed better efficiency than the Gemma tokenizer, with our tokenizer having a 5.35% better compression factor.
We present an investigation into the relationship between the average depth of the first correct prediction and the performance of CodeGen. This was done on a dataset comprised of code files comprised of C++, Go, Java, Julia, Kotlin, and Python. The analysis involved investigating the model's predictions at different layers using a Tuned Lens, which enables examining the intermediate representations. Additionally, attention heads were examined to gain insights into the model's behavior. We found that there is a subset of four layers in which tokens are predicted correctly for the first time. These peaks are evident in CodeGen's performance and come after a small dip, a dip that is present in the last layer. The results shed light on the varying performance of different layers and provide valuable insights into the strengths and weaknesses of CodeGen. These findings contribute to our greater understanding of language model performance in code completion tasks and provide implications for future improvements in this domain.
...
We present an investigation into the relationship between the average depth of the first correct prediction and the performance of CodeGen. This was done on a dataset comprised of code files comprised of C++, Go, Java, Julia, Kotlin, and Python. The analysis involved investigating the model's predictions at different layers using a Tuned Lens, which enables examining the intermediate representations. Additionally, attention heads were examined to gain insights into the model's behavior. We found that there is a subset of four layers in which tokens are predicted correctly for the first time. These peaks are evident in CodeGen's performance and come after a small dip, a dip that is present in the last layer. The results shed light on the varying performance of different layers and provide valuable insights into the strengths and weaknesses of CodeGen. These findings contribute to our greater understanding of language model performance in code completion tasks and provide implications for future improvements in this domain.
The development of contemporary source code auto-completion tools have significantly boosted productivity and efficiency of developers. In 2021, the GPT-2-based Transformer CodeGPT was developed to support code completion and text-to-code generation. Similarly to most code models however, CodeGPT was trained on a limited set of widely-used languages (Java, Python) - leading to constrained efficacy in lower-resource languages. This motivated us to research CodeGPT's performance on the token-level code completion task across high- and low-resource languages. We investigate in which scenarios CodeGPT predicts incorrect tokens with high certainty using a tuned lens, followed by studying attention patterns that underlie the observed behaviour. Our findings indicate that CodeGPT is most competent in Java and Python code (Top-1 accuracies: 69.2% and 68.2% respectively). It generates false predictions with highest confidence when it encounters unfamiliar constructs in low-resource languages, or code structures that cannot be predicted from left context only. Moreover, we find a positive correlation between null attention and model confidence.
...
The development of contemporary source code auto-completion tools have significantly boosted productivity and efficiency of developers. In 2021, the GPT-2-based Transformer CodeGPT was developed to support code completion and text-to-code generation. Similarly to most code models however, CodeGPT was trained on a limited set of widely-used languages (Java, Python) - leading to constrained efficacy in lower-resource languages. This motivated us to research CodeGPT's performance on the token-level code completion task across high- and low-resource languages. We investigate in which scenarios CodeGPT predicts incorrect tokens with high certainty using a tuned lens, followed by studying attention patterns that underlie the observed behaviour. Our findings indicate that CodeGPT is most competent in Java and Python code (Top-1 accuracies: 69.2% and 68.2% respectively). It generates false predictions with highest confidence when it encounters unfamiliar constructs in low-resource languages, or code structures that cannot be predicted from left context only. Moreover, we find a positive correlation between null attention and model confidence.
Large Language Models of code have seen significant jumps in performance recently. However, these jumps tend to accompany a notable and perhaps concerning increase in scale and costs. We contribute an evaluation of prediction performance with respect to model size by assessing the layer-wise progression for language and user-defined elements in code, using a new technique of Tuned Lenses. We show that language-defined elements can be predicted more accurately in earlier layers of the PolyCoder model than user-defined elements and contribute an evaluation of the attention mechanism, which shows patterns that explain such aspects of performance and indicate areas of missed potential. These findings encourage research into the internal prediction performance for other characteristic aspects of code and could lead to the introduction of new methods that make use of these characteristics to improve performance without relying on scaling.
...
Large Language Models of code have seen significant jumps in performance recently. However, these jumps tend to accompany a notable and perhaps concerning increase in scale and costs. We contribute an evaluation of prediction performance with respect to model size by assessing the layer-wise progression for language and user-defined elements in code, using a new technique of Tuned Lenses. We show that language-defined elements can be predicted more accurately in earlier layers of the PolyCoder model than user-defined elements and contribute an evaluation of the attention mechanism, which shows patterns that explain such aspects of performance and indicate areas of missed potential. These findings encourage research into the internal prediction performance for other characteristic aspects of code and could lead to the introduction of new methods that make use of these characteristics to improve performance without relying on scaling.
In recent years, deep learning techniques, particularly transformer models, have demonstrated remarkable advancements in the accuracy and efficiency of language models. These models provide the foundation for many natural language processing tasks, including code completion. The effectiveness of code completion models has been the subject of a variety of empirical studies. However, none of the existing literature has explicitly investigated the potential impact of common code structures on the performance of large language models during code completion. This paper evaluates the influence of common code structures on the code completion performance of CodeParrot, a state-of-the-art natural language processing model. Using the tuned lens method, we show that typical code structures lead to a higher completion accuracy compared to uncommon code structures, due to their frequent occurrence, consistent syntax, clear semantics, and contextual clues. Finally, we perform an attention investigation to assess the significance of the common code structures and reveal potential data patterns across low- and high-resource languages.
...
In recent years, deep learning techniques, particularly transformer models, have demonstrated remarkable advancements in the accuracy and efficiency of language models. These models provide the foundation for many natural language processing tasks, including code completion. The effectiveness of code completion models has been the subject of a variety of empirical studies. However, none of the existing literature has explicitly investigated the potential impact of common code structures on the performance of large language models during code completion. This paper evaluates the influence of common code structures on the code completion performance of CodeParrot, a state-of-the-art natural language processing model. Using the tuned lens method, we show that typical code structures lead to a higher completion accuracy compared to uncommon code structures, due to their frequent occurrence, consistent syntax, clear semantics, and contextual clues. Finally, we perform an attention investigation to assess the significance of the common code structures and reveal potential data patterns across low- and high-resource languages.