Using cross-model learnings for the Gram Vaani ASR Challenge 2022

Journal Article (2022)
Author(s)

T.B. Patel (TU Delft - Multimedia Computing)

Odette Scharenborg (TU Delft - Multimedia Computing)

Multimedia Computing
Copyright
© 2022 T.B. Patel, O.E. Scharenborg
DOI related publication
https://doi.org/10.21437/Interspeech.2022-10639
More Info
expand_more
Publication Year
2022
Language
English
Copyright
© 2022 T.B. Patel, O.E. Scharenborg
Multimedia Computing
Volume number
2022-September
Pages (from-to)
4880-4884
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

In the diverse and multilingual land of India, Hindi is spoken as a first language by a majority of its population. Efforts are made to obtain data in terms of audio, transcriptions, dictionary, etc. to develop speech-technology applications in Hindi. Similarly, the Gram-Vaani ASR Challenge 2022 provides spontaneous telephone speech, with natural back-ground and regional variations in Hindi. The challenge provides: 100 hours of labeled train-set, 5 hours of labeled dev-set and 1000 hours of unlabeled data-set. For the 'Closed Challenge', we trained an End-to-End (E2E) Conformer model using speed perturbations, SpecAugment techniques and use VTLN to handle any unknown speaker groups in the blind evaluation set. On the dev-set, we achieved a 30.3% WER compared to the 34.8% WER by the Challenge E2E baseline. For the 'Self Supervised Closed Challenge', a semi-supervised learning approach is used. We generate pseudo-transcripts for the unlabeled data using a hybrid TDNN-3gram LM model and trained an E2E model. This is then used as a seed for retraining the E2E model with high confidence data. Cross-model learning and refining of the E2E model gave 25.3% WER on the dev-set compared to ∼33-35% WER by the Challenge baseline that use wav2vec models.

Files

Patel22_interspeech.pdf
(pdf | 0.521 Mb)
- Embargo expired in 01-07-2023
License info not available