Using cross-model learnings for the Gram Vaani ASR Challenge 2022

None, None; None, None

Using cross-model learnings for the Gram Vaani ASR Challenge 2022

Journal Article (2022)

Author(s)

T.B. Patel (TU Delft - Multimedia Computing)

Odette Scharenborg (TU Delft - Multimedia Computing)

Multimedia Computing

Copyright

DOI related publication

https://doi.org/10.21437/Interspeech.2022-10639

Hybrid ASR Semi-supervised learning Cross-architecture learning End-to-end ASR Gram-Vaani Challenge

To reference this document use:

https://resolver.tudelft.nl/uuid:c3fe23e3-a025-49a6-953c-30349cae003a

More Info

expand_more

Publication Year

2022

Language

English

Copyright

Multimedia Computing

Volume number

2022-September

Pages (from-to)

4880-4884

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

In the diverse and multilingual land of India, Hindi is spoken as a first language by a majority of its population. Efforts are made to obtain data in terms of audio, transcriptions, dictionary, etc. to develop speech-technology applications in Hindi. Similarly, the Gram-Vaani ASR Challenge 2022 provides spontaneous telephone speech, with natural back-ground and regional variations in Hindi. The challenge provides: 100 hours of labeled train-set, 5 hours of labeled dev-set and 1000 hours of unlabeled data-set. For the 'Closed Challenge', we trained an End-to-End (E2E) Conformer model using speed perturbations, SpecAugment techniques and use VTLN to handle any unknown speaker groups in the blind evaluation set. On the dev-set, we achieved a 30.3% WER compared to the 34.8% WER by the Challenge E2E baseline. For the 'Self Supervised Closed Challenge', a semi-supervised learning approach is used. We generate pseudo-transcripts for the unlabeled data using a hybrid TDNN-3gram LM model and trained an E2E model. This is then used as a seed for retraining the E2E model with high confidence data. Cross-model learning and refining of the E2E model gave 25.3% WER on the dev-set compared to ∼33-35% WER by the Challenge baseline that use wav2vec models.

Files

Patel22_interspeech.pdf

(pdf | 0.521 Mb)

- Embargo expired in 01-07-2023

License info not available