Improving whispered speech recognition using pseudo-whispered based data augmentation

Master Thesis (2023)

Authors

Z. Lin (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Supervisors

Odette Scharenborg (Multimedia Computing)

Faculty

Electrical Engineering, Mathematics and Computer Science, Electrical Engineering, Mathematics and Computer Science

Signal processing Automatic speech recognition Whispered speech Pseudo-whisper

To reference this document use:

https://resolver.tudelft.nl/5f51b210-c2b5-4093-8ec3-7b6ed5bfc5c7

More Info

expand_more

Publication Year

2023

Language

English

Graduation Date

29-08-2023

Awarding Institution

Delft University of Technology

Programme

Electrical Engineering

Faculty

Electrical Engineering, Mathematics and Computer Science, Electrical Engineering, Mathematics and Computer Science

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Whispering, characterized by its soft, breathy, and hushed qualities, serves as a distinct form of speech commonly employed for private communication and can also occur in cases of pathological speech. The acoustic characteristics of whispered speech differ substantially from normally phonated speech and the scarcity of adequate training data leads to low automatic speech recognition (ASR) performance. This project aims to build an ASR system that can recognize both normal and whispered speech and discover which acoustic characteristics of whispered speech have an impact on whispered speech recognition.
In my study, I use signal processing techniques that transform the spectral characteristics of normal speech to those of pseudo-whispered speech, called pseudo-whispered-based data augmentation. I enhance an End-to-End ASR system by incorporating pseudo-whispered speech and state-of-the-art (SOTA) data augmentation methods, speed perturbation and SpecAugment, yielding an 18.2\% relative reduction in word error rate compared to the strongest baseline.
Results for the accented speaker groups in the wTIMIT database show the best results for US English. Further investigation uncovers that the lack of pitch in whispered speech has the largest impact on the performance of whispered speech ASR.

Files

TU_Delft_thesis_Chaufang.pdf

(pdf | 1.53 Mb)

License info not available