Annotation Practices in Societally Impactful Machine Learning Applications
What are these automated systems actually trained on?
S. Lupșa (TU Delft - Electrical Engineering, Mathematics and Computer Science)
A.M. Demetriou – Mentor (TU Delft - Multimedia Computing)
Cynthia C. S. Liem – Mentor (TU Delft - Multimedia Computing)
J. Yang – Graduation committee member (TU Delft - Web Information Systems)
More Info
expand_more
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.
Abstract
This study examines dataset annotation practices in influential NeurIPS research. Datasets employed in highly cited NeurIPS papers were assessed based on criteria concerning their item population, labelling schema, and annotation process. While high-level information, such as the presence of human labellers and item population, is present in most cases, procedural details of the annotation process are poorly reported. Notably, 48% of datasets lack details on annotator training, 43% omit inter-rater reliability, and 28% are not publicly accessible. Temporal comparisons show minor improvements, but no substantial progress in reporting annotation methodology. A complementary analysis of 49 NeurIPS papers published since 2020 shows that researchers often discuss the broader impact of their work, yet do not include datasets or their annotations in these assessments. These findings highlight a lack of standardisation in annotation reporting and call for more robust practices that ensure transparency, auditability, and reproducibility in machine learning research.