Evaluating an ASR Pipeline for a Social Robot

More Info
expand_more

Abstract

There has been a big increase in the use of social robots, such as Pepper, which use verbal communication as the main method of interacting with a human. Verbal communication with a robot is performed using Automatic Speech Recognition (ASR) to recognize words from an audio stream containing speech.
These social robots are being more frequently used in noisy environments.
As such, this thesis investigates 1) whether Pepper's built-in keyword spotter can be replaced by an ASR system able to recognize continuous speech in Dutch; 2) whether Pepper's ASR pipeline can be made more robust against noise, without changing its hardware.
To that end, Pepper's built-in keyword spotter and Sound Source Localization (SSL) algorithm are evaluated against an ASR pipeline based on a Delay-and-Sum beamformer, MUSIC Sound Source Localization, and Google Cloud Speech-to-Text.

The proposed pipeline showed a significant decrease in Keyword Error Rate of 6.2\% compared to Pepper's built-in keyword spotter, and a significant decrease of Word Error Rate (WER) of 21.4\% on Dutch continuous speech in clean listening conditions. A decrease in WER of 13.3\% was observed in an SNR of 8 dB, and a decrease in WER persisted throughout lower Signal-to-Noise ratios (SNR).

As such, it has been shown that Pepper's speech recognition can be improved and made more robust against noise by preprocessing the audio using MUSIC SSL and a Delay-and-Sum beamformer, and transcribing the speech (in Dutch) using Google Cloud Speech-to-Text.

Files

License info not available