Dataset quality within a societally impactful machine learning domain

None, None

Dataset quality within a societally impactful machine learning domain

An overview of data collection and annotation practices of the datasets used by papers published by the ACL

Bachelor Thesis (2025)

Author(s)

A. Fazakas (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

C.C.S. Liem – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

J. Yang – Graduation committee member (TU Delft - Electrical Engineering, Mathematics and Computer Science)

A.M. Demetriou – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Faculty

Electrical Engineering, Mathematics and Computer Science

Annotation Practices Dataset quality Responsible Research Data reporting Data collection practices

To reference this document use

https://resolver.tudelft.nl/uuid:a1b9d753-1083-47c2-a09d-0c685c034432

More Info

expand_more

Publication Year

2025

Language

English

Graduation Date

24-06-2025

Awarding Institution

Delft University of Technology

Project

CSE3000 Research Project

Programme

Computer Science and Engineering

Faculty

Electrical Engineering, Mathematics and Computer Science

Downloads counter

189

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

This study gives an overview of the data collection and annotation practices of the datasets used by the most impactful papers published by the Association of Computational Linguistics (ACL). This was achieved by selecting the most highly cited papers published within the ACL anthology across 3 periods (published in the past 2, 5 and 15 years). Afterwards, the datasets used by those papers were extracted and filtered to retain the most impactful ones. Finally, a carefully crafted annotation schema was used to find out information regarding key aspects of the datasets in order to qualitatively analyze them. As a result of this analysis, it was first found that (1) there are fewer datasets used on average in the past 2 years and that there is little overlap with the datasets used by papers published in the past 5 or 15 years. (2) Secondly, there are various concerns related to those key aspects, such as the relatively high (∼36%) and unregulated use of the Amazon Mechanical Turk crowdsourcing platform for the construction of datasets. Another concern is information frequently missing about any rationale regarding labeller population, prescreening, inter-rater reliability and rationale regarding sample size - missing ∼77%, ∼63%, ∼19-56%, and ∼81% of the time. However, reporting practices for most of those issues have slightly improved within datasets used in the past 2 years. (3) Finally, around one third of the information sought was missing across all periods. However, the state of the domain has been generally improving, with a lower one fourth of the information missing from datasets used in the past 2 years. Some recommendations are given in order to overcome those challenges, the most important of which being that each academic organization should require their submissions to include a reporting template in their papers.

Files

Research_Project_1_.pdf

(pdf | 1.77 Mb)

License info not available