Improving Northern Regional Dutch Speech Recognition by Adapting Perturbation-based Data Augmentation

More Info
expand_more

Abstract

Automatic speech recognition (ASR) does not perform equally well on every speaker. There is bias against many attributes, including accent. To train Dutch ASR, there exists CGN(Corpus Gesproken Nederlands) and as an extension, the JASMIN corpus with annotated accented data. This paper focuses on improving ASR performance for NRAD (Northern regional accented Dutch) speech, training on speakers from the region of Overijssel. To achieve this improvement, the corpus data is augmented using Vocal Tract Length Perturbation (VTLP), which entails randomly warping the frequency of each recording using a factor in the range [0.9, 1.1]. The baseline and augmented ASR systems are trained using trigram GMM-HMM (Gaussian mixture model hidden Markov models) through the Kaldi toolkit on the DelftBlue supercomputer. This leads to improvements on word error rates (WER) for all speaker groups and styles, with an overall relative improvement of 14,64% and the biggest improvement observed for male speakers - from 25.15% WER to 19,68% WER. The impact of this augmentation on other accents and non-accented speech is not explored. This experiment can serve as a stepping stone for developing overall more robust and less biased Dutch ASR.