Improving State-of-the-Art ASR Systems for Speakers with Dysarthria

None, None

Improving State-of-the-Art ASR Systems for Speakers with Dysarthria

Applying Low-Rank Adaptation Transfer Learning to Whisper

Bachelor Thesis (2024)

Author(s)

M. Günther (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

Z. Yue – Mentor (TU Delft - Multimedia Computing)

Y. Zhang – Mentor (TU Delft - Multimedia Computing)

Thomas Durieux – Graduation committee member (TU Delft - Software Engineering)

Faculty

Electrical Engineering, Mathematics and Computer Science

Transfer Learning Automatic Speech Recognition Dysarthria Low-Rank Adaptation Whisper Model

To reference this document use:

https://resolver.tudelft.nl/uuid:9e54041d-1ecf-42ea-823c-203a0dd0b3b1

More Info

expand_more

Publication Year

2024

Language

English

Graduation Date

27-06-2024

Awarding Institution

Delft University of Technology

Project

['CSE3000 Research Project']

Programme

['Computer Science and Engineering']

Faculty

Electrical Engineering, Mathematics and Computer Science

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Dysarthria is a speech disorder that limits an individual’s ability to clearly articulate, due to the weakening of the muscles involved in speech. Despite recent advances in Automatic Speech Recognition (ASR), the recognition of dysarthric speech remains a significant challenge because of the limited availability of dysarthric speech data, significant speaker variability, and the mismatch between typical and dysarthric speech patterns. This paper addresses these challenges by using transfer learning and Low-Rank Adaptation (LoRA) techniques to enhance the performance of the state- of-the-art ASR model Whisper on dysarthric speech. By fine-tuning Whisper with the TORGO dataset, this study aims to adapt the pre-trained models to better recognise dysarthric speech patterns, thus reducing Word Error Rates (WER) and improving accessibility for individuals with speech impairments. Experimental results indicate that this approach can improve speech recognition performance since the Large- V2, Large-V3 and the corresponding distilled models achieved a reduction in WER after fine-tuning. The Large-V3 model achieved the greatest relative WER reduction of 22.65%.

Files

Mirella-Gu_nther-Final-Paper.p... (pdf)

(pdf | 1.22 Mb)

License info not available