How Good Are State-of-the-Art Automatic Speech Recognition Systems in Recognizing Dutch Diverse Speech?

An Evaluation of Meta MMS and OpenAI Whisper on Native and Non-Native Dutch Speech

More Info
expand_more

Abstract

Automatic speech recognition (ASR) is increasingly used in daily applications, such as voice-activated virtual assistants like Siri and Alexa, real-time transcription for meetings and lectures, and voice commands for smart home devices. However, studies show that even state-of-the-art (SotA) ASR systems do not recognize the speech of everyone equally well.

To the best of my knowledge, this paper, for the first time, evaluates the performance of Meta's SotA ASR system, Massively Multilingual Speech (MMS), on Dutch native and non-native speech. Using the Jasmin Corpus dataset, which includes a diverse set of both native and non-native Dutch speakers, this study uses metrics such as word error rate (WER), character error rate (CER), and word information lost (WIL) to assess performance. Additionally, the same methodology is applied to the same data using OpenAI's ASR system, Whisper, to provide a comparative analysis.

The paper analyzes WER, CER, and WIL error metrics, processing time, and investigates the best-suited beam size for Whisper. It also lists out the types of errors made in terms of deletions, insertions, and substitutions made by each model across different age groups of Dutch speakers.