The effects on speech detection of low sample frequency audio data

More Info
expand_more

Abstract

The interactions between human and machines are now common in our daily life. The audio data of human communication is a rich source of information, but it is con- sidered privacy-invasive for machines to listen to it. By reducing sampling frequency, it is possible to preserve privacy by making conversation unclear while still being possible to detect if someone is speaking or not. The topic of this paper is to investigate how low sampled frequency audio data hinders the detection of speech. To detect speaking, voice activity detection has been applied, which is a technology in the signal process- ing field that identifies which short segments of audio contain speakings. Two types of state-of-art voice activity detector(VAD) were used for this experiment including a supervised (pyannote) and two unsupervised (rVAD pitch and flatness mode) methods. As a result, the unsupervised methods outperformed the supervised model, where rVAD pitch mode has resulted in the best performance out of all three. More specifically, the unsupervised VAD’s performance became lower as the sample rates decreased while the supervised VAD did not work well at higher sample frequency. rVAD pitch mode at sample rates of 8000Hz or higher was possible to perform at the almost same level as a state-of-art supervised VAD that is trained in a similar data set. Furthermore, it was able to perform as well as a modern unsupervised VAD at 2000Hz or higher sample frequencies. At the sample rate of 1250Hz or lower, any VAD was not able to perform at the same level as a state-of-art VAD. Regarding the privacy aspect, it is observed that human ears detect speaking better than computers, where humans can understand parts or all of the contents of speaking at 2000Hz or higher, which infers that current technology is not enough to detect speech from downsampled privacy-preserving audio. However, there is still a need for further research to verify the effects of the training set and its sample frequencies for the supervised method and also proper scientific so- cial experiments to test the ability of humans of speech detection for reduced sampled audio.