Data Hound: Analyzing Boilerplate Code Data Smell on Large Code Datasets
S.A. Minkov (TU Delft - Electrical Engineering, Mathematics and Computer Science)
Arie Deursen – Mentor (TU Delft - Software Engineering)
Maliheh Izadi – Mentor (TU Delft - Software Engineering)
J.B. Katzy – Mentor (TU Delft - Software Engineering)
R.M. Popescu – Mentor (TU Delft - Software Engineering)
More Info
expand_more
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.
Abstract
As Large Language Models become an ever more integral part of Software Engineering, often assisting developers on coding tasks, the need for an unbiased evaluation of their performance on such tasks grows [1]. Data smells [2] are reported to have an impact on a Large Language Model’s ability on such tasks [ 3]. Boilerplate code is considered to be a subcategory of said smells. In this paper, we investigate a specific type of this smell, boilerplate API usage patterns. We analyze their prevalence in The Heap dataset [1] and examine how they may bias reference-based evaluation of Large Language Models on code generation tasks. Our findings show that while this data smell is relatively rare, instances containing it are significantly easier for LLMs to predict. We attribute this to partial memorization of common boilerplate patterns, which inflates perceived model performance.