Data hound: Analysing non-English data smells in large code datasets

None, None

Data hound: Analysing non-English data smells in large code datasets

Bachelor Thesis (2025)

Author(s)

B.M. Buzatu (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

A Van Deursen – Graduation committee member (TU Delft - Software Engineering)

Maliheh Izadi – Graduation committee member (TU Delft - Software Engineering)

J. Katzy – Mentor (TU Delft - Software Engineering)

R.M. Popescu – Mentor (TU Delft - Software Engineering)

A. Anand – Graduation committee member (TU Delft - Web Information Systems)

Faculty

Electrical Engineering, Mathematics and Computer Science

Large Language Models Code Generation Multilingual Data Smell Code Summarisation

To reference this document use:

https://resolver.tudelft.nl/uuid:abcdf342-af98-4547-9003-b1b29d56502d

More Info

expand_more

Publication Year

2025

Language

English

Graduation Date

27-06-2025

Awarding Institution

Delft University of Technology

Project

['CSE3000 Research Project']

Programme

['Computer Science and Engineering']

Faculty

Electrical Engineering, Mathematics and Computer Science

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Large Language Models (LLMs) are increasingly used for code-centric tasks. However, their training data often exhibits data smells that may hinder downstream quality. This research focuses on the “Uneven Natural Languages” smell and the presence of non-English text in source code and investigates its effect on LLM-based code generation and summarisation. We construct a three-stage (Detection, Generation, Evaluation) pipeline that annotates every character in a file with its predicted language using Tree-sitter, FastText, and pycld2; masks target spans via causal masking and Fill-in-the-Middle (FIM) and prompts using three chosen models (SmolLM2, StarCoder 2, and Mellum-4B). The Heap dataset is used for the pipeline; however, this research only focuses on the Java subset of the Heap.

In 3.35 million Java files, we find that English tokens account for more than 90\% of comments, strings, and identifiers, while Chinese, Spanish, Portuguese, and French form a long-tailed minority. Despite this skew, LLMs achieve marginally higher BLEU, METEOR, ROUGE, and Exact Match scores when non-English elements are present or masked. Mellum consistently yields the most fluent continuations; StarCoder 2 retains broader token recall; SmolLM2 lags on both axes, reflecting its smaller capacity.

Our publicly available code enables reproducible assessment of multilingual data smells and lays the groundwork for cleaner, language-aware pre-training corpora and more robust multilingual code assistants.

Files

_Research_Project_Data_hound_A... (pdf)

(pdf | 0.746 Mb)

License info not available