A. Fazakas

info

Please Note

<p>This page displays the records of the person named above and is not linked to a unique person identifier. This record may need to be merged to a profile.</p>

Bachelor thesis (1)

1 records found

Dataset quality within a societally impactful machine learning domain

An overview of data collection and annotation practices of the datasets used by papers published by the ACL

Bachelor thesis (2025) - A. Fazakas, C.C.S. Liem, J. Yang, A.M. Demetriou

This study gives an overview of the data collection and annotation practices of the datasets used by the most impactful papers published by the Association of Computational Linguistics (ACL). This was achieved by selecting the most highly cited papers published within the ACL anthology across 3 periods (published in the past 2, 5 and 15 years). Afterwards, the datasets used by those papers were extracted and filtered to retain the most impactful ones. Finally, a carefully crafted annotation schema was used to find out information regarding key aspects of the datasets in order to qualitatively analyze them. As a result of this analysis, it was first found that (1) there are fewer datasets used on average in the past 2 years and that there is little overlap with the datasets used by papers published in the past 5 or 15 years. (2) Secondly, there are various concerns related to those key aspects, such as the relatively high (∼36%) and unregulated use of the Amazon Mechanical Turk crowdsourcing platform for the construction of datasets. Another concern is information frequently missing about any rationale regarding labeller population, prescreening, inter-rater reliability and rationale regarding sample size - missing ∼77%, ∼63%, ∼19-56%, and ∼81% of the time. However, reporting practices for most of those issues have slightly improved within datasets used in the past 2 years. (3) Finally, around one third of the information sought was missing across all periods. However, the state of the domain has been generally improving, with a lower one fourth of the information missing from datasets used in the past 2 years. Some recommendations are given in order to overcome those challenges, the most important of which being that each academic organization should require their submissions to include a reporting template in their papers. ...