Dataset quality within a societally impactful machine learning domain

An overview of data collection and annotation practices of the datasets used by papers published by the ACL

Bachelor Thesis (2025)
Author(s)

A. Fazakas (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

C.C.S. Liem – Mentor (TU Delft - Multimedia Computing)

J. Yang – Graduation committee member (TU Delft - Web Information Systems)

A.M. Demetriou – Mentor (TU Delft - Multimedia Computing)

Faculty
Electrical Engineering, Mathematics and Computer Science
More Info
expand_more
Publication Year
2025
Language
English
Graduation Date
24-06-2025
Awarding Institution
Delft University of Technology
Project
['CSE3000 Research Project']
Programme
['Computer Science and Engineering']
Faculty
Electrical Engineering, Mathematics and Computer Science
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

This study gives an overview of the data collection and annotation practices of the datasets used by the most impactful papers published by the Association of Computational Linguistics (ACL). This was achieved by selecting the most highly cited papers published within the ACL anthology across 3 periods (published in the past 2, 5 and 15 years). Afterwards, the datasets used by those papers were extracted and filtered to retain the most impactful ones. Finally, a carefully crafted annotation schema was used to find out information regarding key aspects of the datasets in order to qualitatively analyze them. As a result of this analysis, it was first found that (1) there are fewer datasets used on average in the past 2 years and that there is little overlap with the datasets used by papers published in the past 5 or 15 years. (2) Secondly, there are various concerns related to those key aspects, such as the relatively high (∼36%) and unregulated use of the Amazon Mechanical Turk crowdsourcing platform for the construction of datasets. Another concern is information frequently missing about any rationale regarding labeller population, prescreening, inter-rater reliability and rationale regarding sample size - missing ∼77%, ∼63%, ∼19-56%, and ∼81% of the time. However, reporting practices for most of those issues have slightly improved within datasets used in the past 2 years. (3) Finally, around one third of the information sought was missing across all periods. However, the state of the domain has been generally improving, with a lower one fourth of the information missing from datasets used in the past 2 years. Some recommendations are given in order to overcome those challenges, the most important of which being that each academic organization should require their submissions to include a reporting template in their papers.

Files

Research_Project_1_.pdf
(pdf | 1.77 Mb)
License info not available