Dysarthric Speech Recognition Fusing Large Pre-Trained Model Extracted Acoustic Features With Articulatory Data
X. Xu (TU Delft - Electrical Engineering, Mathematics and Computer Science)
Zhengjun Yue – Mentor (TU Delft - Multimedia Computing)
O.E. Scharenborg – Mentor (TU Delft - Multimedia Computing)
More Info
expand_more
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.
Abstract
Dysarthric speech recognition is challenging due to speech variability caused by neurological disorders. This study explores integrating articulatory features with large pre-trained acoustic model features (e.g., WavLM, Whisper) to improve recognition performance. Different fusion strategies, including concatenation and cross-attention mechanisms, are also compared in this work. Experimental results show that articulatory features can enhance WavLM-extracted features, reducing WER for moderate and mild severity level. t-SNE analysis reveal how articulatory features influence feature representations. These findings highlight the potential of multimodal fusion in improving dysarthric ASR systems.