Improving child speech recognition with augmented child-like speech

None, None; None, None; None, None; None, None

Improving child speech recognition with augmented child-like speech

Conference Paper (2024)

Author(s)

Yuanyuan Zhang (TU Delft - Multimedia Computing)

Zhengjun Yue (TU Delft - Multimedia Computing)

T.B. Patel (TU Delft - Multimedia Computing)

Odette Scharenborg (TU Delft - Multimedia Computing)

Multimedia Computing

DOI related publication

https://doi.org/10.21437/Interspeech.2024-485

Data augmentation Child speech recognition Child-to-child voice conversion Cross-lingual voice conversion

To reference this document use:

https://resolver.tudelft.nl/uuid:8dd52696-d920-41e9-a58f-6a791f21c2a3

More Info

expand_more

Publication Year

2024

Language

English

Multimedia Computing

Volume number

2024

Pages (from-to)

5183-5187

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

State-of-the-art ASRs show suboptimal performance for child speech. The scarcity of child speech limits the development of child speech recognition (CSR). Therefore, we studied child-to-child voice conversion (VC) from existing child speakers in the dataset and additional (new) child speakers via monolingual and cross-lingual (Dutch-to-German) VC, respectively. The results showed that cross-lingual child-to-child VC significantly improved child ASR performance. Experiments on the impact of the quantity of child-to-child cross-lingual VC-generated data on fine-tuning (FT) ASR models gave the best results with two-fold augmentation for our FT-Conformer model and FT-Whisper model which reduced WERs with ~3% absolute compared to the baseline, and with six-fold augmentation for the model trained from scratch, which improved by an absolute 3.6% WER. Moreover, using a small amount of "high-quality" VC-generated data achieved similar results to those of our best-FT models.

Files

Zhang24d_interspeech.pdf

(pdf | 0.32 Mb)

License info not available