Speech recognition performance disparities between Dutch diverse speaker groups

Journal Article (2026)
Author(s)

Yuanyuan Zhang (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Thomas De Valck (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Odette Scharenborg (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Research Group
Multimedia Computing
DOI related publication
https://doi.org/10.1515/phon-2025-0061 Final published version
More Info
expand_more
Publication Year
2026
Language
English
Research Group
Multimedia Computing
Journal title
phonetica
Downloads counter
28
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Current state-of-the-art automatic speech recognition (ASR) systems recognize typical speech (very) well. However, recent research has shown that their performance degrades for “diverse” speech, i.e., speech that diverges from “typical” speech due to, among others, demographic and sociolinguistic factors. In this work, given the rapid development of ASR technologies, we examined the performance of nine recently released ASR systems developed by Google, Microsoft, Meta, NVIDIA, and OpenAI, and three custom ASR models trained from scratch, on Dutch diverse speech. Our results showed that although overall recognition results differ quite substantially between the different systems, all systems show similar patterns regarding recognition performance for diverse speaker groups: for most ASR systems and models, language proficiency differences and severe speech motor impairment had a greater impact on performance disparities between speaker groups than demographic or sociolinguistic factors, indicating that acoustic variability due to demographic and sociolinguistic factors is well-represented in “typical speech” training data and consequently is well-modeled in the models. Furthermore, we found that differences in data processing pipelines and decoding setups significantly influenced recognition performance. Importantly, updates to company-developed ASR systems do not always improve performance of or reduce performance disparities between diverse speaker groups.