YZ

Y. Zhang

info

Please Note

7 records found

Current state-of-the-art automatic speech recognition (ASR) systems recognize typical speech (very) well. However, recent research has shown that their performance degrades for “diverse” speech, i.e., speech that diverges from “typical” speech due to, among others, demographic and sociolinguistic factors. In this work, given the rapid development of ASR technologies, we examined the performance of nine recently released ASR systems developed by Google, Microsoft, Meta, NVIDIA, and OpenAI, and three custom ASR models trained from scratch, on Dutch diverse speech. Our results showed that although overall recognition results differ quite substantially between the different systems, all systems show similar patterns regarding recognition performance for diverse speaker groups: for most ASR systems and models, language proficiency differences and severe speech motor impairment had a greater impact on performance disparities between speaker groups than demographic or sociolinguistic factors, indicating that acoustic variability due to demographic and sociolinguistic factors is well-represented in “typical speech” training data and consequently is well-modeled in the models. Furthermore, we found that differences in data processing pipelines and decoding setups significantly influenced recognition performance. Importantly, updates to company-developed ASR systems do not always improve performance of or reduce performance disparities between diverse speaker groups. ...
Conference paper (2025) - Z. Yue, Y. Zhang
Automatic dysarthric speech recognition (ADSR) remains challenging due to the irregularities in speech caused by motor control impairments and the limited availability of dysarthric speech data. This paper explores the integration of articulatory features, captured using Electromagnetic Articulography (EMA), with both conventional acoustic features and those extracted from large-scale pretrained models including Whisper and XLSR-53 as well as the fine-tuned Whisper model. We propose end-to-end (E2E) Conformer-based acoustic-articulatory models for ADSR and compare their performance against the corresponding hybrid TDNNF models. The experimental results show that using the fine-tuned Whisper features (Whisper-FT) fused with articulatory features achieves the lowest (10.5%) word error rate (WER) on dysarthric speech, with particularly significant improvements for severely dysarthric speech, reaching a WER of 20.8%. ...
Conference paper (2025) - Zhengjun Yue, Mara Barberis, Tanvina Patel, Judith Dineley, Willemijn Doedens, Lottie Stipdonk, Yuanyuan Zhang, Elke De Witte, Odette Scharenborg, More authors...
Speech technologies have advanced significantly, yet they remain largely trained on typical speech, limiting their applicability to individuals with speech and language impairments. A key obstacle is the lack of well-annotated and representative atypical speech corpora. This paper conducts a multi-project survey and shares the first-hand experience on the challenges of collecting, annotating, using, and sharing atypical speech data. Experiences from seven research projects on collecting atypical speech data, involving both academic and clinical perspectives, are reported and potential issues are discussed. Furthermore, the paper provides practical guidelines that allow for standardisation and harmonisation of data collection practices, which are crucial to allow studies to be compared, replicated, and validated, which is essential for developing more inclusive and effective speech technologies. ...
Conference paper (2024) - Y. Zhang, Z. Yue, T.B. Patel, O.E. Scharenborg
State-of-the-art ASRs show suboptimal performance for child speech. The scarcity of child speech limits the development of child speech recognition (CSR). Therefore, we studied child-to-child voice conversion (VC) from existing child speakers in the dataset and additional (new) child speakers via monolingual and cross-lingual (Dutch-to-German) VC, respectively. The results showed that cross-lingual child-to-child VC significantly improved child ASR performance. Experiments on the impact of the quantity of child-to-child cross-lingual VC-generated data on fine-tuning (FT) ASR models gave the best results with two-fold augmentation for our FT-Conformer model and FT-Whisper model which reduced WERs with ~3% absolute compared to the baseline, and with six-fold augmentation for the model trained from scratch, which improved by an absolute 3.6% WER. Moreover, using a small amount of "high-quality" VC-generated data achieved similar results to those of our best-FT models. ...
Conference paper (2023) - YuanYuan Zhang, Aaricia Herygers, Tanvina Patel, Zhengjun Yue, Odette Scharenborg
Automatic speech recognition (ASR) should serve every speaker, not only the majority “standard” speakers of a language. In order to build inclusive ASR, mitigating the bias against speaker groups who speak in a “non-standard” or “diverse” way is crucial. We aim to mitigate the bias against non-native-accented Flemish in a Flemish ASR system. Since this is a low-resource problem, we investigate the optimal type of data augmentation, i.e., speed/pitch perturbation, cross-lingual voice conversion-based methods, and SpecAugment, applied to both native Flemish and non-native-accented Flemish, for bias mitigation. The results showed that specific types of data augmentation applied to both native and non-native-accented speech improve non-native-accented ASR while applying data augmentation to the non-native-accented speech is more conducive to bias reduction. Combining both gave the largest bias reduction for human-machine interaction (HMI) as well as read-type speech. ...
Conference paper (2022) - Yixuan Zhang, Y. Zhang, T.B. Patel, O.E. Scharenborg
One important problem that needs tackling for wide deployment of Automatic Speech Recognition (ASR) is the bias in ASR, i.e., ASRs tend to generate more accurate predictions for certain speaker groups while making more errors on speech from other groups. We aim to reduce bias against non-native speakers of Dutch compared to native Dutch speakers. We investigate three different data augmentation techniques - speed and volume perturbation and pitch shift - to increase the amount of non-native accented Dutch training data, and use the augmented data for two transfer learning techniques: model fine-tuning and multi-task learning, to reduce bias in a state-of-the-art hybrid HMM-DNN Kaldi-based ASR system. Experimental results on Dutch read speech and human-machine interaction (HMI) speech showed that although individual data augmentation techniques did not always yield an improved recognition performance, the combination of all three did. Importantly, bias was reduced by more than 18% absolute compared to the baseline system for read speech when applying pitch shift and multitask training, and by more than 7% for HMI speech when applying all three data augmentation techniques during fine-tuning, while improving recognition accuracy of both native and non-native Dutch speech. ...
Automatic speech recognition (ASR) systems have seen substantial improvements in the past decade; however, not for all speaker groups. Recent research shows that bias exists against different types of speech, including non-native accents, in state-of-the-art (SOTA) ASR systems. To attain inclusive speech recognition, i.e., ASR for everyone irrespective of how one speaks or the accent one has, bias mitigation is necessary. Here we focus on bias mitigation against non-native accents using two different approaches: data augmentation and by using more effective training methods. We used an autoencoder-based cross-lingual voice conversion (VC) model to increase the amount of non-native accented speech training data in addition to data augmentation through speed perturbation. Moreover, we investigate two training methods, i.e., fine-tuning and domain adversarial training (DAT), to see whether they can use the limited non-native accented speech data more effectively than a standard training approach. Experimental results show that VC-based data augmentation successfully mitigates the bias against non-native accents for the SOTA end-to-end (E2E) Dutch ASR system. Combining VC and speed perturbed data gave the lowest word error rate (WER) and the smallest bias against nonnative accents. Fine-tuning and DAT reduced the bias against non-native accents but at the cost of native performance. ...