Data Hound: Analyzing Boilerplate Code Data Smell on Large Code Datasets

None, None

Data Hound: Analyzing Boilerplate Code Data Smell on Large Code Datasets

Bachelor Thesis (2025)

Author(s)

S.A. Minkov (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

Arie Deursen – Mentor (TU Delft - Software Engineering)

Maliheh Izadi – Mentor (TU Delft - Software Engineering)

J.B. Katzy – Mentor (TU Delft - Software Engineering)

R.M. Popescu – Mentor (TU Delft - Software Engineering)

Faculty

Electrical Engineering, Mathematics and Computer Science

Data Smells, Boilerplate Code, Large Language Models, Memorization in Large Language Models

To reference this document use:

https://resolver.tudelft.nl/uuid:b14f61c3-6ba5-4c91-89b3-31186bc33256

More Info

expand_more

Publication Year

2025

Language

English

Coordinates

4.3756361,52.0027516

Graduation Date

01-07-2025

Awarding Institution

Delft University of Technology

Project

['CSE3000 Research Project']

Programme

['Computer Science and Engineering']

Faculty

Electrical Engineering, Mathematics and Computer Science

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

As Large Language Models become an ever more integral part of Software Engineering, often assisting developers on coding tasks, the need for an unbiased evaluation of their performance on such tasks grows [1]. Data smells [2] are reported to have an impact on a Large Language Model’s ability on such tasks [ 3]. Boilerplate code is considered to be a subcategory of said smells. In this paper, we investigate a specific type of this smell, boilerplate API usage patterns. We analyze their prevalence in The Heap dataset [1] and examine how they may bias reference-based evaluation of Large Language Models on code generation tasks. Our findings show that while this data smell is relatively rare, instances containing it are significantly easier for LLMs to predict. We attribute this to partial memorization of common boilerplate patterns, which inflates perceived model performance.

Files

Stefan_Thesis.pdf

(pdf | 0.727 Mb)

License info not available