Signal-processing of audio for speech-recognition

More Info
expand_more

Abstract

The transcription of voice using neural networks is a technique that deserves attention, as speech assistants are becoming increasingly popular. Neural networks have often difficulty with determining the differences between a talking person and noise. Humans have a much better understanding of this and could possibly apply their knowledge of the structure of the signals to improve the understanding of the neural network. A problem that is extremely difficult for a neural network is understanding and transcribing the lyrics of a song.

This thesis analyzes signal-processing techniques that can be applied to a song to improve the understanding of a speech-recognition algorithm. It is mainly focused on filtering the fore- ground lyrics from the accompaniment. Some basic filtering methods are described including a low-amplitude filter and a band-pass filter. But also two more complicated filters which make use of the periodicity of the background music will be treated.
The first filter is a method of voice separation using the two-dimensional Fourier transform. This method, proposed by Prem Seetharaman, Fatemeh Pishdadian, Bryan Pardo in 2017 [15], combines techniques of signal-processing and image-processing by finding periodic repetitions in a signal by identifying peaks in the two-dimensional Fourier transform of the spectrogram of the signal.
The second filter is a newly proposed method that can be used for the separation of foreground from background music. The algorithm compares columns in the spectrogram and classifies columns as overlapping if there are multiple occurrences of columns similar to the selected col- umn (repetitions). The frequency components, the different frequencies obtained from a discrete short-time Fourier transform, of overlapping columns are afterward compared with components of the same frequency in other columns. Under certain circumstances, overlapping frequency components are subtracted from components in other columns of the spectrogram. This removes repetitions of that frequency throughout the song. The components of the spectrogram that re- main after several iterations of this method are most likely to correspond to the least repetitive parts of the song.

The decisions that are made while constructing the method of comparing spectrogram columns are discussed and are compared with steps performed in the method that uses the two-dimensional Fourier transform. An implementation and demonstration are also attached. From the research it is expected that the two-dimensional Fourier transform perform better on strict periodic accompaniment, while the method that compares spectrogram columns is more likely to perform better on songs with a less tight rhythm.