Data Smells in Public Datasets

None, None; None, None; None, None

Data Smells in Public Datasets

Conference Paper (2022)

Author(s)

Arumoy Shome (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Luis Cruz (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Arie Van Deursen (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Research Group

Software Engineering

Data quality Code smells Ai engineering Data smells

DOI related publication

https://doi.org/10.1145/3522664.3528621 Final published version

To reference this document use

https://resolver.tudelft.nl/uuid:88a2eb26-cc21-4213-88db-406dfcad8f3b

More Info

expand_more

Publication Year

2022

Language

English

Research Group

Software Engineering

Pages (from-to)

205-216

ISBN (electronic)

978-1-4503-9275-4

Event

1st International Conference on AI Engineering - Software Engineering for AI, CAIN 2022 (2022-05-16 - 2022-05-17), Pittsburgh, United States

Downloads counter

228

Collections

Institutional Repository

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

The adoption of Artificial Intelligence (AI) in high-stakes domains such as healthcare, wildlife preservation, autonomous driving and criminal justice system calls for a data-centric approach to AI. Data scientists spend the majority of their time studying and wrangling the data, yet tools to aid them with data analysis are lacking. This study identifies the recurrent data quality issues in public datasets. Analogous to code smells, we introduce a novel catalogue of data smells that can be used to indicate early signs of problems or technical debt in machine learning systems. To understand the prevalence of data quality issues in datasets, we analyse 25 public datasets and identify 14 data smells.

Files

Data_Smells_in_Public_Datasets... (pdf)

(pdf | 0.621 Mb)

- Embargo expired in 27-07-2022

License info not available