Characterising and Mitigating Aggregation-Bias in Crowdsourced Toxicity Annotations

Conference Paper (2018)
Authors

Agathe Balayn (Student TU Delft, IBM Nederland)

Panagiotis Mavridis (TU Delft - Web Information Systems)

A Bozzon (TU Delft - Web Information Systems)

B.F.L. Timmermans (IBM Nederland)

Zoltán Szlávik (IBM Nederland)

Research Group
Web Information Systems
Copyright
© 2018 A.M.A. Balayn, P. Mavridis, A. Bozzon, B.F.L. Timmermans, Z. Szlávik
More Info
expand_more
Publication Year
2018
Language
English
Copyright
© 2018 A.M.A. Balayn, P. Mavridis, A. Bozzon, B.F.L. Timmermans, Z. Szlávik
Research Group
Web Information Systems
Volume number
2276
Pages (from-to)
67-71
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Training machine learning (ML) models for natural language processing usually requires large amount of data, often acquired through crowdsourcing. The way this data is collected and aggregated can have an effect on the outputs of the trained model such as ignoring the labels which differ from the majority. In this paper we investigate how label aggregation can bias the ML results towards certain data samples and propose a methodology to highlight and mitigate this bias. Although our work is applicable to any kind of label aggregation for data subject to multiple interpretations, we focus on the effects of the bias introduced by majority voting on toxicity prediction over sentences. Our preliminary results point out that we can mitigate the majority-bias and get increased prediction accuracy for the minority opinions if we take into account the different labels from annotators when training adapted models, rather than rely on the aggregated labels.

Files

Paper7.pdf
(pdf | 0.392 Mb)
License info not available