Advances in DFT-Based Single-Microphone Speech Enhancement

More Info
expand_more

Abstract

The interest in the field of speech enhancement emerges from the increased usage of digital speech processing applications like mobile telephony, digital hearing aids and human-machine communication systems in our daily life. The trend to make these applications mobile increases the variety of potential sources for quality degradation. Speech enhancement methods can be used to increase the quality of these speech processing devices and make them more robust under noisy conditions. The name "speech enhancement" refers to a large group of methods that are all meant to improve certain quality aspects of these devices. Examples of speech enhancement algorithms are echo control, bandwidth extension, packet loss concealment and noise reduction. In this thesis we focus on single-microphone additive noise reduction and aim at methods that work in the discrete Fourier transform (DFT) domain. The main objective of the presented research is to improve on existing single-microphone schemes for an extended range of noise types and noise levels, thereby making these methods more suitable for mobile speech communication applications than state-of-the-art algorithms. The research topics in this thesis are three-fold. At first, we focus on improved estimation of the a priori signal-to-noise ratio (SNR) from the noisy speech. We focus on two aspects of a priori SNR estimation. Firstly, we present an adaptive time-segmentation algorithm, which we use to reduce the variance of the estimated a priori SNR. Secondly, an approach is presented to reduce the bias of the estimated a priori SNR, which is often present during transitions between speech sounds. Secondly, we investigate the derivation of clean speech estimators under models that take properties of speech into account. This problem is approached from two different angles. At first, we consider the derivation of clean speech estimators under the use of a combined stochastic/deterministic model for the complex DFT coefficients. The use of a deterministic model is based on the fact that certain speech sounds have a more deterministic character. Secondly, we focus on the derivation of complex DFT and magnitude DFT estimators under super-Gaussian densities. Derivation of clean speech estimators under these types of densities is based on measured histograms of speech DFT coefficients. We present two different type of estimators under super-Gaussian densities. Minimum mean-square error (MMSE) estimators are derived under a generalized Gamma density for the clean speech DFT coefficients and DFT magnitudes. Maximum a posteriori (MAP) estimators are derived under the multivariate normal inverse Gaussian (MNIG) density for the clean speech DFT coefficients. Estimators derived under the MNIG density have some theoretical advantages over estimators derived under the generalized Gamma density. More specifically, under the MNIG density the statistical models in the complex DFT and the polar domain are consistent, which is not the case for estimators derived under the generalized Gamma density. In addition, the MNIG density can model vector processes, which allows for taking into account the dependency between the real and imaginary part of DFT coefficients. Finally, we developed a method for tracking of the noise power spectral density (PSD). The developed method is based on the eigenvalue decomposition of correlation matrices that are constructed from time series of noisy DFT coefficients. This approach makes it possible, in contrast to existing methods, to update the noise PSD when speech is continuously present. Furthermore, the tracking delay is considerably reduced compared to state-of-the-art noise tracking algorithms. A comparison is performed between a combination of individual components presented in this thesis and a state-of-the-art speech enhancement system from literature. Subjective experiments by means of a listening test show that the system based on contributions of this thesis improves significantly over the state-of-the-art speech enhancement system.