Evaluating the Use of Frequency Masking on a Hybrid Automatic Speech Recognizer for Transitional Dutch Accent of JASMIN-CGN Corpus

More Info
expand_more

Abstract

There are many experiments conducted with Automatic Speech Recognition (ASR) systems, but many either focus on specific speaker categories or on a language in general. Therefore, bias could occur in such ASR systems towards different genders, age groups, or dialects. But, to analyze and reduce bias, the models require significant amounts of data to be trained on, and some corpora lack that. This is where augmentation techniques can be used to generate more unique data without any further collection of it. This paper explores the use of SpecAugment's frequency masking on such a corpus, JASMIN-CGN, for the Transitional regional accent of Dutch, with a hybrid GMM-HMM architecture, in order to reduce the bias for gender or age, for this specific dialect. The experiments show that SpecAugment does not manage to lower the WER (20.8% overall compared to the baseline model, which achieves 19.5% performance), on the contrary, it even increases the bias for age. The results are mainly attributed to the combination of low amounts of data + the hybrid architecture used, which proves SpecAugment to be a useful augmentation policy only for end-to-end models.