Improving End-to-End Models for Children’s Speech Recognition

None, None; None, None

Improving End-to-End Models for Children’s Speech Recognition

Journal Article (2024)

Author(s)

T.B. Patel (TU Delft - Electrical Engineering, Mathematics and Computer Science)

O.E. Scharenborg (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Research Group

Multimedia Computing

Children’s speech recognition Speed perturbations Spectral augmentation Vocal tract length normalization End-to-end automatic speech recognition

DOI related publication

https://doi.org/10.3390/app14062353 Final published version

To reference this document use

https://resolver.tudelft.nl/uuid:8babbc9c-1424-42fb-9231-0046e0acc023

More Info

expand_more

Publication Year

2024

Language

English

Research Group

Multimedia Computing

Issue number

6

Volume number

14

Article number

2353

Downloads counter

296

Collections

Institutional Repository

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Children’s Speech Recognition (CSR) is a challenging task due to the high variability in children’s speech patterns and limited amount of available annotated children’s speech data. We aim to improve CSR in the often-occurring scenario that no children’s speech data is available for training the Automatic Speech Recognition (ASR) systems. Traditionally, Vocal Tract Length Normalization (VTLN) has been widely used in hybrid ASR systems to address acoustic mismatch and variability in children’s speech when training models on adults’ speech. Meanwhile, End-to-End (E2E) systems often use data augmentation methods to create child-like speech from adults’ speech. For adult speech-trained ASRs, we investigate the effectiveness of augmentation methods; speed perturbations and spectral augmentation, along with VTLN, in an E2E framework for the CSR task, comparing these across Dutch, German, and Mandarin. We applied VTLN at different stages (training/test) of the ASR and conducted age and gender analyses. Our experiments showed highly similar patterns across the languages: Speed Perturbations and Spectral Augmentation yield significant performance improvements, while VTLN provided further improvements while maintaining recognition performance on adults’ speech (depending on when it is applied). Additionally, VTLN showed performance improvement for both male and female speakers and was particularly effective for younger children.

Files

Applsci-14-02353.pdf

(pdf | 0.73 Mb)

License info not available