Improving Northern Regional Dutch Speech Recognition by Adapting Perturbation-based Data Augmentation

None, None

Improving Northern Regional Dutch Speech Recognition by Adapting Perturbation-based Data Augmentation

Bachelor Thesis (2022)

Author(s)

N.A. Zhlebinkov (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

O.E. Scharenborg – Mentor (TU Delft - Multimedia Computing)

T.B. Patel – Mentor (TU Delft - Multimedia Computing)

Joana P. Gonçalves – Graduation committee member (TU Delft - Pattern Recognition and Bioinformatics)

Faculty

Electrical Engineering, Mathematics and Computer Science

Copyright

Data Augmentation Speech recognition Vocal tract length perturbation

To reference this document use:

https://resolver.tudelft.nl/uuid:081e1dc0-6bb3-454c-95cf-b0ac50d7d554

More Info

expand_more

Publication Year

2022

Language

English

Copyright

Graduation Date

22-06-2022

Awarding Institution

Delft University of Technology

Project

['CSE3000 Research Project']

Programme

['Computer Science and Engineering']

Faculty

Electrical Engineering, Mathematics and Computer Science

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Automatic speech recognition (ASR) does not perform equally well on every speaker. There is bias against many attributes, including accent. To train Dutch ASR, there exists CGN(Corpus Gesproken Nederlands) and as an extension, the JASMIN corpus with annotated accented data. This paper focuses on improving ASR performance for NRAD (Northern regional accented Dutch) speech, training on speakers from the region of Overijssel. To achieve this improvement, the corpus data is augmented using Vocal Tract Length Perturbation (VTLP), which entails randomly warping the frequency of each recording using a factor in the range [0.9, 1.1]. The baseline and augmented ASR systems are trained using trigram GMM-HMM (Gaussian mixture model hidden Markov models) through the Kaldi toolkit on the DelftBlue supercomputer. This leads to improvements on word error rates (WER) for all speaker groups and styles, with an overall relative improvement of 14,64% and the biggest improvement observed for male speakers - from 25.15% WER to 19,68% WER. The impact of this augmentation on other accents and non-accented speech is not explored. This experiment can serve as a stepping stone for developing overall more robust and less biased Dutch ASR.

Files

RP_Paper_NZhlebinkov_v3.5.pdf

(pdf | 0.136 Mb)

License info not available