Data Smells in Public Datasets

Conference Paper (2022)
Author(s)

Arumoy Shome (TU Delft - Software Engineering)

Luis Cruz (TU Delft - Software Engineering)

Arie Van Deursen (TU Delft - Software Technology)

DOI related publication
https://doi.org/10.1145/3522664.3528621 Final published version
More Info
expand_more
Publication Year
2022
Language
English
Pages (from-to)
205-216
ISBN (electronic)
978-1-4503-9275-4
Event
Downloads counter
208
Collections
Institutional Repository
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

The adoption of Artificial Intelligence (AI) in high-stakes domains such as healthcare, wildlife preservation, autonomous driving and criminal justice system calls for a data-centric approach to AI. Data scientists spend the majority of their time studying and wrangling the data, yet tools to aid them with data analysis are lacking. This study identifies the recurrent data quality issues in public datasets. Analogous to code smells, we introduce a novel catalogue of data smells that can be used to indicate early signs of problems or technical debt in machine learning systems. To understand the prevalence of data quality issues in datasets, we analyse 25 public datasets and identify 14 data smells.

Files

Data_Smells_in_Public_Datasets... (pdf)
(pdf | 0.621 Mb)
- Embargo expired in 27-07-2022
License info not available