T.B. Patel | TU Delft Repository

Objective and Subjective Evaluation of Diffusion-Based Speech Enhancement for Dysarthric Speech

Conference paper (2025) - Dimme de Groot, Tanvina Patel, Devendra Kayande, Odette Scharenborg, Zhengjun Yue

Dysarthric speech poses significant challenges for automatic speech recognition (ASR) systems due to its high variability and reduced intelligibility. In this work we explore the use of diffusion models for dysarthric speech enhancement, which is based on the hypothesis that using diffusion-based speech enhancement moves the distribution of dysarthric speech closer to that of typical speech, which could potentially improve dysarthric speech recognition performance. We assess the effect of two diffusion-based and one signal-processing-based speech enhancement algorithms on intelligibility and speech quality of two English dysarthric speech corpora. We applied speech enhancement to both typical and dysarthric speech and evaluate the ASR performance using Whisper-Turbo, and the subjective and objective speech quality of the original and enhanced dysarthric speech. We also fine-tuned Whisper-Turbo on the enhanced speech to assess its impact on recognition performance. ...

Challenges and practical guidelines for atypical speech data collection, annotation, usage and sharing

A multi-project perspective

Conference paper (2025) - Zhengjun Yue, Mara Barberis, Tanvina Patel, Judith Dineley, Willemijn Doedens, Lottie Stipdonk, Yuanyuan Zhang, Elke De Witte, Odette Scharenborg, More authors...

Speech technologies have advanced significantly, yet they remain largely trained on typical speech, limiting their applicability to individuals with speech and language impairments. A key obstacle is the lack of well-annotated and representative atypical speech corpora. This paper conducts a multi-project survey and shares the first-hand experience on the challenges of collecting, annotating, using, and sharing atypical speech data. Experiences from seven research projects on collecting atypical speech data, involving both academic and clinical perspectives, are reported and potential issues are discussed. Furthermore, the paper provides practical guidelines that allow for standardisation and harmonisation of data collection practices, which are crucial to allow studies to be compared, replicated, and validated, which is essential for developing more inclusive and effective speech technologies. ...

Improving child speech recognition with augmented child-like speech

Conference paper (2024) - Y. Zhang, Z. Yue, T.B. Patel, O.E. Scharenborg

State-of-the-art ASRs show suboptimal performance for child speech. The scarcity of child speech limits the development of child speech recognition (CSR). Therefore, we studied child-to-child voice conversion (VC) from existing child speakers in the dataset and additional (new) child speakers via monolingual and cross-lingual (Dutch-to-German) VC, respectively. The results showed that cross-lingual child-to-child VC significantly improved child ASR performance. Experiments on the impact of the quantity of child-to-child cross-lingual VC-generated data on fine-tuning (FT) ASR models gave the best results with two-fold augmentation for our FT-Conformer model and FT-Whisper model which reduced WERs with ~3% absolute compared to the baseline, and with six-fold augmentation for the model trained from scratch, which improved by an absolute 3.6% WER. Moreover, using a small amount of "high-quality" VC-generated data achieved similar results to those of our best-FT models. ...

Using articulated speech EEG signals for imagined speech decoding

Conference paper (2024) - Chris Bras, Tanvina Patel, Odette Scharenborg

Brain-Computer Interfaces (BCIs) open avenues for communication among individuals unable to use voice or gestures. Silent speech interfaces are one such approach for BCIs that could offer a transformative means of connecting with the external world. Performance on imagined speech decoding however is rather low due to, amongst others, data scarcity and the lack of a clear starting and end point of the imagined speech in the brain signal. We investigate whether using electroencephalography (EEG) signals from articulated speech can be used to improve imagined speech decoding in two ways: we investigate whether articulated speech EEG signals can be used to predict the end point of the imagined speech and use the articulated speech EEG as extra training data for speaker-independent imagined vowel classification. Our results show that using EEG data from articulated speech did not improve classification of vowels in imagined speech, probably due to high variability in EEG signals amongst speakers. ...

Improving End-to-End Models for Children’s Speech Recognition

Journal article (2024) - T.B. Patel, O.E. Scharenborg

Children’s Speech Recognition (CSR) is a challenging task due to the high variability in children’s speech patterns and limited amount of available annotated children’s speech data. We aim to improve CSR in the often-occurring scenario that no children’s speech data is available for training the Automatic Speech Recognition (ASR) systems. Traditionally, Vocal Tract Length Normalization (VTLN) has been widely used in hybrid ASR systems to address acoustic mismatch and variability in children’s speech when training models on adults’ speech. Meanwhile, End-to-End (E2E) systems often use data augmentation methods to create child-like speech from adults’ speech. For adult speech-trained ASRs, we investigate the effectiveness of augmentation methods; speed perturbations and spectral augmentation, along with VTLN, in an E2E framework for the CSR task, comparing these across Dutch, German, and Mandarin. We applied VTLN at different stages (training/test) of the ASR and conducted age and gender analyses. Our experiments showed highly similar patterns across the languages: Speed Perturbations and Spectral Augmentation yield significant performance improvements, while VTLN provided further improvements while maintaining recognition performance on adults’ speech (depending on when it is applied). Additionally, VTLN showed performance improvement for both male and female speakers and was particularly effective for younger children. ...

Improving Whispered Speech Recognition Performance Using Pseudo-Whispered Based Data Augmentation

Conference paper (2023) - Zhaofeng Lin, Tanvina Patel, Odette Scharenborg

Whispering is a distinct form of speech known for its soft, breathy, and hushed characteristics, often used for private communication. The acoustic characteristics of whispered speech differ substantially from normally phonated speech and the scarcity of adequate training data leads to low automatic speech recognition (ASR) performance. To address the data scarcity issue, we use a signal processing-based technique that transforms the spectral characteristics of normal speech to those of pseudo-whispered speech. We augment an End-to-End ASR with pseudo-whispered speech and achieve an 18.2 % relative reduction in word error rate for whispered speech compared to the baseline. Results for the individual speaker groups in the wTIMIT database show the best results for US English. Further investigation showed that the lack of glottal information in whispered speech has the largest impact on whispered speech ASR performance. ...

Exploring Data Augmentation in Bias Mitigation Against Non-Native-Accented Speech

Conference paper (2023) - YuanYuan Zhang, Aaricia Herygers, Tanvina Patel, Zhengjun Yue, Odette Scharenborg

Automatic speech recognition (ASR) should serve every speaker, not only the majority “standard” speakers of a language. In order to build inclusive ASR, mitigating the bias against speaker groups who speak in a “non-standard” or “diverse” way is crucial. We aim to mitigate the bias against non-native-accented Flemish in a Flemish ASR system. Since this is a low-resource problem, we investigate the optimal type of data augmentation, i.e., speed/pitch perturbation, cross-lingual voice conversion-based methods, and SpecAugment, applied to both native Flemish and non-native-accented Flemish, for bias mitigation. The results showed that specific types of data augmentation applied to both native and non-native-accented speech improve non-native-accented ASR while applying data augmentation to the non-native-accented speech is more conducive to bias reduction. Combining both gave the largest bias reduction for human-machine interaction (HMI) as well as read-type speech. ...

Using cross-model learnings for the Gram Vaani ASR Challenge 2022

Journal article (2022) - Tanvina Patel, Odette Scharenborg

In the diverse and multilingual land of India, Hindi is spoken as a first language by a majority of its population. Efforts are made to obtain data in terms of audio, transcriptions, dictionary, etc. to develop speech-technology applications in Hindi. Similarly, the Gram-Vaani ASR Challenge 2022 provides spontaneous telephone speech, with natural back-ground and regional variations in Hindi. The challenge provides: 100 hours of labeled train-set, 5 hours of labeled dev-set and 1000 hours of unlabeled data-set. For the 'Closed Challenge', we trained an End-to-End (E2E) Conformer model using speed perturbations, SpecAugment techniques and use VTLN to handle any unknown speaker groups in the blind evaluation set. On the dev-set, we achieved a 30.3% WER compared to the 34.8% WER by the Challenge E2E baseline. For the 'Self Supervised Closed Challenge', a semi-supervised learning approach is used. We generate pseudo-transcripts for the unlabeled data using a hybrid TDNN-3gram LM model and trained an E2E model. This is then used as a seed for retraining the E2E model with high confidence data. Cross-model learning and refining of the E2E model gave 25.3% WER on the dev-set compared to ∼33-35% WER by the Challenge baseline that use wav2vec models. ...

Mitigating bias against non-native accents

Journal article (2022) - Yuanyuan Zhang, Yixuan Zhang, Bence Mark Halpern, Tanvina Patel, Odette Scharenborg

Automatic speech recognition (ASR) systems have seen substantial improvements in the past decade; however, not for all speaker groups. Recent research shows that bias exists against different types of speech, including non-native accents, in state-of-the-art (SOTA) ASR systems. To attain inclusive speech recognition, i.e., ASR for everyone irrespective of how one speaks or the accent one has, bias mitigation is necessary. Here we focus on bias mitigation against non-native accents using two different approaches: data augmentation and by using more effective training methods. We used an autoencoder-based cross-lingual voice conversion (VC) model to increase the amount of non-native accented speech training data in addition to data augmentation through speed perturbation. Moreover, we investigate two training methods, i.e., fine-tuning and domain adversarial training (DAT), to see whether they can use the limited non-native accented speech data more effectively than a standard training approach. Experimental results show that VC-based data augmentation successfully mitigates the bias against non-native accents for the SOTA end-to-end (E2E) Dutch ASR system. Combining VC and speed perturbed data gave the lowest word error rate (WER) and the smallest bias against nonnative accents. Fine-tuning and DAT reduced the bias against non-native accents but at the cost of native performance. ...

Comparing data augmentation and training techniques to reduce bias against non-native accents in hybrid speech recognition systems

Conference paper (2022) - Yixuan Zhang, Y. Zhang, T.B. Patel, O.E. Scharenborg

One important problem that needs tackling for wide deployment of Automatic Speech Recognition (ASR) is the bias in ASR, i.e., ASRs tend to generate more accurate predictions for certain speaker groups while making more errors on speech from other groups. We aim to reduce bias against non-native speakers of Dutch compared to native Dutch speakers. We investigate three different data augmentation techniques - speed and volume perturbation and pitch shift - to increase the amount of non-native accented Dutch training data, and use the augmented data for two transfer learning techniques: model fine-tuning and multi-task learning, to reduce bias in a state-of-the-art hybrid HMM-DNN Kaldi-based ASR system. Experimental results on Dutch read speech and human-machine interaction (HMI) speech showed that although individual data augmentation techniques did not always yield an improved recognition performance, the combination of all three did. Importantly, bias was reduced by more than 18% absolute compared to the baseline system for read speech when applying pitch shift and multitask training, and by more than 7% for HMI speech when applying all three data augmentation techniques during fine-tuning, while improving recognition accuracy of both native and non-native Dutch speech. ...