Mitigating bias against non-native accents

Journal Article (2022)
Author(s)

Yuanyuan Zhang (TU Delft - Multimedia Computing)

Yixuan Zhang (Student TU Delft)

Bence M. Halpern (Nederlands Kanker Instituut - Antoni van Leeuwenhoek ziekenhuis, TU Delft - Multimedia Computing, Universiteit van Amsterdam)

T.B. Patel (TU Delft - Multimedia Computing)

Odette Scharenborg (TU Delft - Multimedia Computing)

Multimedia Computing
Copyright
© 2022 Y. Zhang, Yixuan Zhang, B.M. Halpern, T.B. Patel, O.E. Scharenborg
DOI related publication
https://doi.org/10.21437/Interspeech.2022-836
More Info
expand_more
Publication Year
2022
Language
English
Copyright
© 2022 Y. Zhang, Yixuan Zhang, B.M. Halpern, T.B. Patel, O.E. Scharenborg
Multimedia Computing
Volume number
2022-September
Pages (from-to)
3168-3172
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Automatic speech recognition (ASR) systems have seen substantial improvements in the past decade; however, not for all speaker groups. Recent research shows that bias exists against different types of speech, including non-native accents, in state-of-the-art (SOTA) ASR systems. To attain inclusive speech recognition, i.e., ASR for everyone irrespective of how one speaks or the accent one has, bias mitigation is necessary. Here we focus on bias mitigation against non-native accents using two different approaches: data augmentation and by using more effective training methods. We used an autoencoder-based cross-lingual voice conversion (VC) model to increase the amount of non-native accented speech training data in addition to data augmentation through speed perturbation. Moreover, we investigate two training methods, i.e., fine-tuning and domain adversarial training (DAT), to see whether they can use the limited non-native accented speech data more effectively than a standard training approach. Experimental results show that VC-based data augmentation successfully mitigates the bias against non-native accents for the SOTA end-to-end (E2E) Dutch ASR system. Combining VC and speed perturbed data gave the lowest word error rate (WER) and the smallest bias against nonnative accents. Fine-tuning and DAT reduced the bias against non-native accents but at the cost of native performance.

Files

Zhang22n_interspeech.pdf
(pdf | 0.506 Mb)
- Embargo expired in 01-07-2023
License info not available