Data Hound: Analyzing Boilerplate Code Data Smell on Large Code Datasets

Bachelor Thesis (2025)
Author(s)

S.A. Minkov (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

Arie Deursen – Mentor (TU Delft - Software Engineering)

Maliheh Izadi – Mentor (TU Delft - Software Engineering)

J.B. Katzy – Mentor (TU Delft - Software Engineering)

R.M. Popescu – Mentor (TU Delft - Software Engineering)

Faculty
Electrical Engineering, Mathematics and Computer Science
More Info
expand_more
Publication Year
2025
Language
English
Coordinates
4.3756361,52.0027516
Graduation Date
01-07-2025
Awarding Institution
Delft University of Technology
Project
['CSE3000 Research Project']
Programme
['Computer Science and Engineering']
Faculty
Electrical Engineering, Mathematics and Computer Science
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

As Large Language Models become an ever more integral part of Software Engineering, often assisting developers on coding tasks, the need for an unbiased evaluation of their performance on such tasks grows [1]. Data smells [2] are reported to have an impact on a Large Language Model’s ability on such tasks [ 3]. Boilerplate code is considered to be a subcategory of said smells. In this paper, we investigate a specific type of this smell, boilerplate API usage patterns. We analyze their prevalence in The Heap dataset [1] and examine how they may bias reference-based evaluation of Large Language Models on code generation tasks. Our findings show that while this data smell is relatively rare, instances containing it are significantly easier for LLMs to predict. We attribute this to partial memorization of common boilerplate patterns, which inflates perceived model performance.

Files

Stefan_Thesis.pdf
(pdf | 0.727 Mb)
License info not available