Speech recognition performance disparities between Dutch diverse speaker groups

None, None; None, None; None, None

Speech recognition performance disparities between Dutch diverse speaker groups

Journal Article (2026)

Author(s)

Yuanyuan Zhang (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Thomas De Valck (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Odette Scharenborg (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Research Group

Multimedia Computing

Automatic speech recognition Dysarthric speech Non-native accents Performance disparities Dutch diverse speech

DOI related publication

https://doi.org/10.1515/phon-2025-0061 Final published version

To reference this document use

https://resolver.tudelft.nl/uuid:69ea0993-4f84-40e0-a382-fe4de51282e4

More Info

expand_more

Publication Year

2026

Language

English

Research Group

Multimedia Computing

Journal title

phonetica

Downloads counter

38

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Current state-of-the-art automatic speech recognition (ASR) systems recognize typical speech (very) well. However, recent research has shown that their performance degrades for “diverse” speech, i.e., speech that diverges from “typical” speech due to, among others, demographic and sociolinguistic factors. In this work, given the rapid development of ASR technologies, we examined the performance of nine recently released ASR systems developed by Google, Microsoft, Meta, NVIDIA, and OpenAI, and three custom ASR models trained from scratch, on Dutch diverse speech. Our results showed that although overall recognition results differ quite substantially between the different systems, all systems show similar patterns regarding recognition performance for diverse speaker groups: for most ASR systems and models, language proficiency differences and severe speech motor impairment had a greater impact on performance disparities between speaker groups than demographic or sociolinguistic factors, indicating that acoustic variability due to demographic and sociolinguistic factors is well-represented in “typical speech” training data and consequently is well-modeled in the models. Furthermore, we found that differences in data processing pipelines and decoding setups significantly influenced recognition performance. Importantly, updates to company-developed ASR systems do not always improve performance of or reduce performance disparities between diverse speaker groups.

Files

10.1515_phon-2025-0061.pdf

(pdf | 2.36 Mb)