Evaluating the Effect of SpecSwap for Purposes of Improving WER Performance of the Western Dutch Region Using the JASMIN-CGN Dataset

More Info
expand_more

Abstract

A problem prevalent in many modern-day Automatic Speech Recognition (ASR) systems is the presence of bias and its reduction. Bias can be observed when an ASR system performs worse on a subset of its speakers compared to the rest rather than having the same overall generalization for everyone. This can be seen by using Word Error Rates (WER) as a metric. Depending on the ASR system in question the type of bias differs. However, techniques have been proposed and shown to succeed in reducing WER, and subsequently bias, by the use of data augmentation techniques for the recorded speech. These techniques perturb the audio in a certain way. Afterward, it is added to a model's training set and the model is retrained with the added data. One such technique is SpecSwap. This paper explores how using SpecSwap affects the WER performance of a hybrid-model ASR system using the JASMIN-CGN dataset's West-Dutch region. For comparison, a state-of-the-art data augmentation technique, VTLP, was also used, which has been shown to be effective in other cases. The experiments both led to a consistent WER increase. Therefore it was concluded that the data provided for the region was too little for the augmentation policy to be effective in any of the subcategories or in the overall performance of the system. However, SpecSwap shows potential in mitigating the widely discussed gender bias in ASR systems by reducing the difference between male and female speakers' WER.