Data Smells in Public Datasets

Conference Paper (2022)
Author(s)

Arumoy Shome (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Luis Cruz (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Arie Van Deursen (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Research Group
Software Engineering
DOI related publication
https://doi.org/10.1145/3522664.3528621 Final published version
More Info
expand_more
Publication Year
2022
Language
English
Research Group
Software Engineering
Pages (from-to)
205-216
ISBN (electronic)
978-1-4503-9275-4
Event
1st International Conference on AI Engineering - Software Engineering for AI, CAIN 2022 (2022-05-16 - 2022-05-17), Pittsburgh, United States
Downloads counter
228
Collections
Institutional Repository
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

The adoption of Artificial Intelligence (AI) in high-stakes domains such as healthcare, wildlife preservation, autonomous driving and criminal justice system calls for a data-centric approach to AI. Data scientists spend the majority of their time studying and wrangling the data, yet tools to aid them with data analysis are lacking. This study identifies the recurrent data quality issues in public datasets. Analogous to code smells, we introduce a novel catalogue of data smells that can be used to indicate early signs of problems or technical debt in machine learning systems. To understand the prevalence of data quality issues in datasets, we analyse 25 public datasets and identify 14 data smells.

Files

Data_Smells_in_Public_Datasets... (pdf)
(pdf | 0.621 Mb)
- Embargo expired in 27-07-2022
License info not available