Data Hound: Analyzing Boilerplate Code Data Smell on Large Code Datasets

Bachelor Thesis (2025)
Author(s)

S.A. Minkov (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

A. van Deursen – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

M. Izadi – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

J.B. Katzy – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

R.M. Popescu – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Faculty
Electrical Engineering, Mathematics and Computer Science
More Info
expand_more
Publication Year
2025
Language
English
Coordinates
4.3756361,52.0027516
Graduation Date
01-07-2025
Awarding Institution
Delft University of Technology
Project
CSE3000 Research Project
Programme
Computer Science and Engineering
Faculty
Electrical Engineering, Mathematics and Computer Science
Downloads counter
133
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

As Large Language Models become an ever more integral part of Software Engineering, often assisting developers on coding tasks, the need for an unbiased evaluation of their performance on such tasks grows [1]. Data smells [2] are reported to have an impact on a Large Language Model’s ability on such tasks [ 3]. Boilerplate code is considered to be a subcategory of said smells. In this paper, we investigate a specific type of this smell, boilerplate API usage patterns. We analyze their prevalence in The Heap dataset [1] and examine how they may bias reference-based evaluation of Large Language Models on code generation tasks. Our findings show that while this data smell is relatively rare, instances containing it are significantly easier for LLMs to predict. We attribute this to partial memorization of common boilerplate patterns, which inflates perceived model performance.

Files

Stefan_Thesis.pdf
(pdf | 0.727 Mb)
License info not available