The effects on speech detection of low sample frequency audio data

None, None

The effects on speech detection of low sample frequency audio data

Bachelor Thesis (2022)

Author(s)

T. Uno (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

H.S. Hung – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

J.D. Vargas Quiros – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

J.A. Baaijens – Graduation committee member (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Faculty

Electrical Engineering, Mathematics and Computer Science

Speech detection Low sample frequency audio Voice Activity Detection

To reference this document use

https://resolver.tudelft.nl/uuid:80ac9f4d-3bdb-4374-9227-343aee94356f

More Info

expand_more

Publication Year

2022

Language

English

Graduation Date

24-06-2022

Awarding Institution

Delft University of Technology

Project

CSE3000 Research Project

Programme

Computer Science and Engineering

Faculty

Electrical Engineering, Mathematics and Computer Science

Downloads counter

304

Collections

thesis

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

The interactions between human and machines are now common in our daily life. The audio data of human communication is a rich source of information, but it is con- sidered privacy-invasive for machines to listen to it. By reducing sampling frequency, it is possible to preserve privacy by making conversation unclear while still being possible to detect if someone is speaking or not. The topic of this paper is to investigate how low sampled frequency audio data hinders the detection of speech. To detect speaking, voice activity detection has been applied, which is a technology in the signal process- ing field that identifies which short segments of audio contain speakings. Two types of state-of-art voice activity detector(VAD) were used for this experiment including a supervised (pyannote) and two unsupervised (rVAD pitch and flatness mode) methods. As a result, the unsupervised methods outperformed the supervised model, where rVAD pitch mode has resulted in the best performance out of all three. More specifically, the unsupervised VAD’s performance became lower as the sample rates decreased while the supervised VAD did not work well at higher sample frequency. rVAD pitch mode at sample rates of 8000Hz or higher was possible to perform at the almost same level as a state-of-art supervised VAD that is trained in a similar data set. Furthermore, it was able to perform as well as a modern unsupervised VAD at 2000Hz or higher sample frequencies. At the sample rate of 1250Hz or lower, any VAD was not able to perform at the same level as a state-of-art VAD. Regarding the privacy aspect, it is observed that human ears detect speaking better than computers, where humans can understand parts or all of the contents of speaking at 2000Hz or higher, which infers that current technology is not enough to detect speech from downsampled privacy-preserving audio. However, there is still a need for further research to verify the effects of the training set and its sample frequencies for the supervised method and also proper scientific so- cial experiments to test the ability of humans of speech detection for reduced sampled audio.

Files

FINAL_Research_Paper.pdf

(pdf | 0.361 Mb)

License info not available