Print Email Facebook Twitter Data Smells in Public Datasets Title Data Smells in Public Datasets Author Shome, A. (TU Delft Software Engineering) Cruz, Luis (TU Delft Software Engineering) van Deursen, A. (TU Delft Software Technology) Department Software Technology Date 2022 Abstract The adoption of Artificial Intelligence (AI) in high-stakes domains such as healthcare, wildlife preservation, autonomous driving and criminal justice system calls for a data-centric approach to AI. Data scientists spend the majority of their time studying and wrangling the data, yet tools to aid them with data analysis are lacking. This study identifies the recurrent data quality issues in public datasets. Analogous to code smells, we introduce a novel catalogue of data smells that can be used to indicate early signs of problems or technical debt in machine learning systems. To understand the prevalence of data quality issues in datasets, we analyse 25 public datasets and identify 14 data smells. Subject ai engineeringcode smellsdata qualitydata smells To reference this document use: http://resolver.tudelft.nl/uuid:88a2eb26-cc21-4213-88db-406dfcad8f3b DOI https://doi.org/10.1145/3522664.3528621 Publisher IEEE Embargo date 2022-07-27 ISBN 978-1-4503-9275-4 Source Proceedings - 1st International Conference on AI Engineering - Software Engineering for AI, CAIN 2022 Event 1st International Conference on AI Engineering - Software Engineering for AI, CAIN 2022, 2022-05-16 → 2022-05-17, Pittsburgh, United States Series Proceedings - 1st International Conference on AI Engineering - Software Engineering for AI, CAIN 2022 Bibliographical note Green Open Access added to TU Delft Institutional Repository ‘You share, we take care!’ – Taverne project https://www.openaccess.nl/en/you-share-we-take-care Otherwise as indicated in the copyright section: the publisher is the copyright holder of this work and the author uses the Dutch legislation to make this work public. Part of collection Institutional Repository Document type conference paper Rights © 2022 A. Shome, Luis Cruz, A. van Deursen Files PDF Data_Smells_in_Public_Datasets.pdf 635.99 KB Close viewer /islandora/object/uuid:88a2eb26-cc21-4213-88db-406dfcad8f3b/datastream/OBJ/view