Large Language Models (LLMs) are increasingly used for code-centric tasks. However, their training data often exhibits data smells that may hinder downstream quality. This research focuses on the “Uneven Natural Languages” smell and the presence of non-English text in source code
...
Large Language Models (LLMs) are increasingly used for code-centric tasks. However, their training data often exhibits data smells that may hinder downstream quality. This research focuses on the “Uneven Natural Languages” smell and the presence of non-English text in source code and investigates its effect on LLM-based code generation and summarisation. We construct a three-stage (Detection, Generation, Evaluation) pipeline that annotates every character in a file with its predicted language using Tree-sitter, FastText, and pycld2; masks target spans via causal masking and Fill-in-the-Middle (FIM) and prompts using three chosen models (SmolLM2, StarCoder 2, and Mellum-4B). The Heap dataset is used for the pipeline; however, this research only focuses on the Java subset of the Heap.
In 3.35 million Java files, we find that English tokens account for more than 90\% of comments, strings, and identifiers, while Chinese, Spanish, Portuguese, and French form a long-tailed minority. Despite this skew, LLMs achieve marginally higher BLEU, METEOR, ROUGE, and Exact Match scores when non-English elements are present or masked. Mellum consistently yields the most fluent continuations; StarCoder 2 retains broader token recall; SmolLM2 lags on both axes, reflecting its smaller capacity.
Our publicly available code enables reproducible assessment of multilingual data smells and lays the groundwork for cleaner, language-aware pre-training corpora and more robust multilingual code assistants.