Raw acoustic-articulatory multimodal dysarthric speech recognition

None, None; None, None; None, None; None, None; None, None

Raw acoustic-articulatory multimodal dysarthric speech recognition

Journal Article (2026)

Author(s)

Zhengjun Yue (TU Delft - Multimedia Computing)

Erfan Loweimi (The University of Edinburgh)

Zoran Cvetkovic (King’s College London)

Jon Barker (University of Sheffield)

Heidi Christensen (University of Sheffield)

Research Group

Multimedia Computing

Multimodality Acoustic modelling Acoustic-articulatory Automatic dysarthric speech recognition Mutual information analysis Raw signal representations

DOI related publication

https://doi.org/10.1016/j.csl.2025.101839 Final published version

To reference this document use

https://resolver.tudelft.nl/uuid:735ee3ba-4b91-4e41-8deb-d41a61574fac

More Info

expand_more

Publication Year

2026

Language

English

Research Group

Multimedia Computing

Journal title

Computer Speech and Language

Volume number

95

Article number

101839

Downloads counter

6

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Automatic speech recognition (ASR) for dysarthric speech is challenging. The acoustic characteristics of dysarthric speech are highly variable and there are often fewer distinguishing cues between phonetic tokens. Multimodal ASR utilises the data from other modalities to facilitate the task when a single acoustic modality proves insufficient. Articulatory information, which encapsulates knowledge about the speech production process, may constitute such a complementary modality. Although multimodal acoustic-articulatory ASR has received increasing attention recently, incorporating real articulatory data is under-explored for dysarthric speech recognition. This paper investigates the effectiveness of multimodal acoustic modelling using real dysarthric speech articulatory information in combination with acoustic features, especially raw signal representations which are more informative than classic features, leading to learning representations tailored to dysarthric ASR. In particular, various raw acoustic-articulatory multimodal dysarthric speech recognition systems are developed and compared with similar systems with hand-crafted features. Furthermore, the difference between dysarthric and typical speech in terms of articulatory information is systematically analysed by using a statistical space distribution indicator called Maximum Articulator Motion Range (MAMR). Additionally, we used mutual information analysis to investigate the robustness and phonetic information content of the articulatory features, offering insights that support feature selection and the ASR results. Experimental results on the widely used TORGO dysarthric speech dataset show that combining the articulatory and raw acoustic features at the empirically found optimal fusion level achieves a notable performance gain, leading to up to 7.6% and 12.8% relative word error rate (WER) reduction for dysarthric and typical speech, respectively.

Files

1-s2.0-S0885230825000646-main.... (pdf)

(pdf | 4.02 Mb)