Low-resource automatic speech recognition and error analyses of oral cancer speech

None, None; None, None; None, None; None, None; None, None

Low-resource automatic speech recognition and error analyses of oral cancer speech

Journal Article (2022)

Author(s)

Bence Mark Halpern (Nederlands Kanker Instituut - Antoni van Leeuwenhoek ziekenhuis, TU Delft - Electrical Engineering, Mathematics and Computer Science, Universiteit van Amsterdam)

Siyuan Feng (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Rob van Son (Universiteit van Amsterdam, Nederlands Kanker Instituut - Antoni van Leeuwenhoek ziekenhuis)

Michiel van den Brekel (Universiteit van Amsterdam, Nederlands Kanker Instituut - Antoni van Leeuwenhoek ziekenhuis)

Odette Scharenborg (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Research Group

Multimedia Computing

Automatic speech recognition Low-resource Oral cancer Pathological speech Phoneme analysis

DOI related publication

https://doi.org/10.1016/j.specom.2022.04.006 Final published version

To reference this document use

https://resolver.tudelft.nl/uuid:9114f9ce-69e9-4d8e-b432-5132fe3c4516

More Info

expand_more

Publication Year

2022

Language

English

Research Group

Multimedia Computing

Journal title

Speech Communication

Volume number

141

Pages (from-to)

14-27

Downloads counter

284

Collections

Institutional Repository

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

In this paper, we introduce a new corpus of oral cancer speech and present our study on the automatic recognition and analysis of oral cancer speech. A two-hour English oral cancer speech dataset is collected from YouTube. Formulated as a low-resource oral cancer ASR task, we investigate three acoustic modelling approaches that previously have worked well with low-resource scenarios using two different architectures; a hybrid architecture and a transformer-based end-to-end (E2E) model: (1) a retraining approach; (2) a speaker adaptation approach; and (3) a disentangled representation learning approach (only using the hybrid architecture). The approaches achieve a (1) 4.7% (hybrid) and 7.5% (E2E); (2) 7.7%; and (3) 2.0% absolute word error rate reduction, respectively, compared to a baseline system which is not trained on oral cancer speech. A detailed analysis of the speech recognition results shows that (1) plosives and certain vowels are the most difficult sounds to recognise in oral cancer speech — this problem is successfully alleviated by our proposed approaches; (3) however these sounds are also relatively poorly recognised in the case of healthy speech with the exception of/p/. (2) recognition performance of certain phonemes is strongly data-dependent; (4) In terms of the manner of articulation, E2E performs better with the exception of vowels — however, vowels have a large contribution to overall performance. As for the place of articulation, vowels, labiodentals, dentals and glottals are better captured by hybrid models, E2E is better on bilabial, alveolar, postalveolar, palatal and velar information. (5) Finally, our analysis provides some guidelines for selecting words that can be used as voice commands for ASR systems for oral cancer speakers.

Files

1_s2.0_S0167639322000620_main.... (pdf)

(pdf | 1.36 Mb)