Whispering, characterized by its soft, breathy, and hushed qualities, serves as a distinct form of speech commonly employed for private communication and can also occur in cases of pathological speech. The acoustic characteristics of whispered speech differ substantially from nor
...
Whispering, characterized by its soft, breathy, and hushed qualities, serves as a distinct form of speech commonly employed for private communication and can also occur in cases of pathological speech. The acoustic characteristics of whispered speech differ substantially from normally phonated speech and the scarcity of adequate training data leads to low automatic speech recognition (ASR) performance. This project aims to build an ASR system that can recognize both normal and whispered speech and discover which acoustic characteristics of whispered speech have an impact on whispered speech recognition.
In my study, I use signal processing techniques that transform the spectral characteristics of normal speech to those of pseudo-whispered speech, called pseudo-whispered-based data augmentation. I enhance an End-to-End ASR system by incorporating pseudo-whispered speech and state-of-the-art (SOTA) data augmentation methods, speed perturbation and SpecAugment, yielding an 18.2\% relative reduction in word error rate compared to the strongest baseline.
Results for the accented speaker groups in the wTIMIT database show the best results for US English. Further investigation uncovers that the lack of pitch in whispered speech has the largest impact on the performance of whispered speech ASR.