Z. Yue | TU Delft Repository

Dysarthric Speech Recognition Fusing Large Pre-Trained Model Extracted Acoustic Features With Articulatory Data

Master thesis (2025) - X. Xu, Z. Yue, O.E. Scharenborg

Dysarthric speech recognition is challenging due to speech variability caused by neurological disorders. This study explores integrating articulatory features with large pre-trained acoustic model features (e.g., WavLM, Whisper) to improve recognition performance. Different fusion strategies, including concatenation and cross-attention mechanisms, are also compared in this work. Experimental results show that articulatory features can enhance WavLM-extracted features, reducing WER for moderate and mild severity level. t-SNE analysis reveal how articulatory features influence feature representations. These findings highlight the potential of multimodal fusion in improving dysarthric ASR systems. ...

Improving the Performance of Automatic Speech Recognition for Children with Developmental Language Disorders

Master thesis (2025) - X. Wan, O.E. Scharenborg, J. Sun, T.J. Viering, Z. Yue

Automatic Speech Recognition (ASR) systems perform well for typical adult speech but remain challenged by children’s speech, especially that of children with Developmental Language Disorder (DLD). This study investigates how ASR performance can be enhanced for DLD speech while maintaining accuracy on typical child speech. Two state-of-the-art ASR models, a conformer-based model and Whisper-large-v3, were evaluated using Dutch typical (Jasmin) and atypical (Auris) child speech. The experiments examine data augmentation methods, including speed perturbation and vocal tract length perturbation, and transfer learning through fine-tuning. Results show that both techniques improve DLD speech recognition without degrading typical speech accuracy. The best performance was achieved by combining augmentation and fine-tuning with domain-matched DLD data, reaching 53.2% WER on the Auris test set, while mismatched fine-tuning reduced gains, particularly for Whisper. Overall, the findings demonstrate that integrating data augmentation and fine-tuning offers an effective, balanced approach toward inclusive and robust ASR for children with DLD. ...

How Does OpenAI’s Whisper Interpret Dysarthric Speech?

An Analysis of Acoustic Feature Probing and Representation Layers for Dysarthic Speech

Bachelor thesis (2024) - O. Agaoglu, Z. Yue, Y. Zhang

This paper investigates how OpenAI’s Whisper model processes dysarthric speech by probing its internal acoustic feature representations. Utilizing the TORGO database, we analyzed Whisper’s capability to encode significant acoustic features specific to dysarthric speech across its encoding layers. Our findings reveal that initial layers are particularly effective in capturing distinct features, while deeper layers show generalized representations. Despite this, Whisper’s zero-shot performance in distinguishing dysarthric speech severity levels is noteworthy. We employed a series of probing tasks tailored to dysarthric speech characteristics to pinpoint specific features and their transformation across the model’s layers. This study highlights Whisper’s potential in handling atypical speech patterns without fine-tuning, paving the way for further research into the interpretability and application of transformer-based models in medical and assistive technologies. We discuss the implications of these results for enhancing transparency, reliability, and safe AI integration in healthcare. ...

Automatic Dysarthria Severity Assessment using Whisper-extracted Features

Evaluating ML architectures for dysarthria severity assessment on TORGO and MSDM

Bachelor thesis (2024) - C. Charlesworth, Zhengjun Yue, YuanYuan Zhang, Thomas Durieux

Dysarthria is a speech disorder commonly caused by neurological disorders such as strokes, cerebral palsy and Amyotrophic Lateral Sclerosis (ALS). The severity level of dysarthria greatly influences the appropriate treatment for a patient. However, assessing the severity of dysarthria in a patient is a time-consuming process that requires a trained speech therapist. Therefore the following work explores a variety of classifier architectures for automatic dysarthria severity assessment using Whisper encodings. The datasets used were MSDM and TORGO while the classifier architectures implemented included a Convolutional Neural Networks and Recurrent Neural Network variants. Across both datasets, the Gated Recurrent Unit network (GRU) achieved the best performance with 97.21% accuracy on MSDM and 97.47% on TORGO. ...

Reducing Bias in State-of-the-Art ASR Systems for Child Speech

Addressing Age and Gender Disparities through Transfer Learning Strategies

Bachelor thesis (2024) - F.A. Zeisler, Y. Zhang, Z. Yue, T. Durieux

Automatic Speech Recognition (ASR) systems have transformed human-machine interaction, yet they often struggle with child speech due to the unique vocal characteristics. This thesis investigates age and gender biases, focusing on enhancing the performance of state-of-the-art ASR model Whisper on child speech. Initial experiments reveal significant disparities in recognition accuracy across age groups and genders within child speech, highlighting the critical need for targeted improvements. The study uses Low-Rank Adaptation (LoRA) to finetune the model using four child-specific datasets, aiming to simultaneously enhance recognition performance and mitigate biases. Results demonstrate substantial reductions in Word Error Rates (WER) and biases after finetuning, showcasing the effectiveness of transfer learning in addressing demographic inequality. Gender biases decreased by 32.77% relative to their initial values, and age biases also improved, with a relative decrease of 27.52% after finetuning. This research showcases the potential of tailored approaches to advance ASR technology for low-resource user demographics, with implications for improving educational and assistive technologies.

Index Terms: Automatic Speech Recognition, Child speech, Whisper ASR model, Age and gender biases, Low-Rank Adaptation, Transfer learning, Demographic disparities ...

Improving State-of-the-Art ASR Systems for Speakers with Dysarthria

Applying Low-Rank Adaptation Transfer Learning to Whisper

Bachelor thesis (2024) - M. Günther, Z. Yue, Y. Zhang, T. Durieux

Dysarthria is a speech disorder that limits an individual’s ability to clearly articulate, due to the weakening of the muscles involved in speech. Despite recent advances in Automatic Speech Recognition (ASR), the recognition of dysarthric speech remains a significant challenge because of the limited availability of dysarthric speech data, significant speaker variability, and the mismatch between typical and dysarthric speech patterns. This paper addresses these challenges by using transfer learning and Low-Rank Adaptation (LoRA) techniques to enhance the performance of the state- of-the-art ASR model Whisper on dysarthric speech. By fine-tuning Whisper with the TORGO dataset, this study aims to adapt the pre-trained models to better recognise dysarthric speech patterns, thus reducing Word Error Rates (WER) and improving accessibility for individuals with speech impairments. Experimental results indicate that this approach can improve speech recognition performance since the Large- V2, Large-V3 and the corresponding distilled models achieved a reduction in WER after fine-tuning. The Large-V3 model achieved the greatest relative WER reduction of 22.65%. ...

Evaluating Alternative Metrics for Dysarthric Speech Recognition

Assessing the Effectiveness of Different Evaluation Metrics in Dysarthric Speech Recognition Systems Across Various Severities

Bachelor thesis (2024) - H.C. Nguyen Duc, Z. Yue, Y. Zhang, T. Durieux

Dysarthria is a motor speech disorder resulting in slurred or slow speech that can be difficult to understand. This re- search paper evaluates the effectiveness of various metrics for automatic speech recognition (ASR), such as character error rate (CER), Jaro-Winkler distance, and BERTscore, in assessing performance specifically for dysarthric speech, which is often inadequately measured by the commonly used word error rate (WER). Using the TORGO database, which includes a range of dysarthria severities, we analyze the performance of chosen evaluation metrics with the Whisper and wav2vec 2.0 ASR systems to understand how they reflect the true speech recognition challenges presented by such atypical speech pat- terns. Our findings reveal that Whisper generally outperforms wav2vec 2.0, particularly in sentence utterances, by effectively managing complex speech patterns and maintaining semantic integrity. The analysis highlights that single-word utterances strongly correlate with dysarthria severity, while sentence utterances show a lesser correlation due to the mitigating effect of additional linguistic context. ...

A Watermark Recognition System: An Approach to Matching Similar Watermarks

Student report (2023) - D. Banta, S. Kho, A.N. Lantink, A. Marin, V. Petkov, M. Skrodzki, Z. Yue

Watermarks are historical motifs present in the texture of paper that are commonly used to identify the paper manufacturers. They only become visible when viewed under certain light conditions. Under ideal circumstances, researchers may use watermarks to determine a historical document’s origins and context. To identify a watermark, it is matched to a previously archived watermark. Currently, this matching must be done manually, which is neither scalable nor parallelizable. Existing studies explore digital reconstructions of watermarks, but do not focus on a comparison-based setup. This report discusses a system that can automatically identify similar watermarks using traditional image processing techniques. The resulting system speeds up the process considerably, can be used on small datasets, and is more accessible to end-users.

The system uses harmonization, feature extraction, and similarity matching. Harmonization involves improving the clarity of the watermark, which is often obscured by the material properties of the paper. Feature extraction involves finding useful information from the isolated watermarks, and similarity matching uses this information to score the similarity of a pair.

We evaluated our system based on a dataset provided by the German Museum of Books and Writing. Over a broader range of quality, accuracy was found to be within the range of 41-53%. It was also found that improving watermark quality within the dataset improved accuracy results to around 82%. The system shows promise particularly with higher quality datasets. This report therefore demonstrates that traditional image processing techniques can be valuable when applied to situations where artificial intelligence may not be possible or efficient. Further research into this domain would be required to understand the advantages and limitations of image processing in comparison with artificial intelligence.
...