Multi-Microphone Signal Parameter Estimation in Various Acoustic Scenarios

Doctoral Thesis (2025)
Author(s)

Changheng Li (TU Delft - Signal Processing Systems)

Contributor(s)

A. J. van der Veen – Promotor (TU Delft - Signal Processing Systems)

R. C. Hendriks – Promotor (TU Delft - Signal Processing Systems)

Research Group
Signal Processing Systems
More Info
expand_more
Publication Year
2025
Language
English
Research Group
Signal Processing Systems
ISBN (print)
978-94-6518-068-7
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Many modern devices, such as mobile phones, hearing aids and (hands-free) acoustic human-machine interfaces are equipped with microphone arrays that can be used for various applications. These applications include source separation, audio quality enhancement, speech intelligibility improvement and source localization. In an ideal anechoic chamber, the signals received by ideal microphones are just attenuated and delayed version of the original sound. However, in practice, obstacles such as the floor, the ceiling and the surrounding walls will reflect the sound to the microphones. Also, the microphone itself will generate noise, distorting the recorded signals. Lastly, it is possible that multiple point sources are active simultaneously. When we consider one point source as the target signal, the other sources could be considered interfering signals. These distortions make it difficult to get access to the target signal. Therefore, spatial filtering is often applied to the microphone signals.

To achieve satisfying performance, these spatial filters typically need to be adaptive to the (changing) scene. Specifically, the filter coefficients depend on the acoustic-scene related parameters that model the microphone signals. These parameters, such as the relative transfer functions (RTFs) of the sources, the power spectral densities (PSDs) of the sources, the late reverberation and the ambient noise, are typically unknown in practice. Therefore, estimation of these parameters is crucial and thus the main focus of the dissertation. While it is relatively straightforward to estimate these parameters in less complex acoustic scenes, these algorithms are usually not applicable and not extendable to more complex acoustic scenes. Therefore, the complexity of the estimation methods needed depends on the complexity of the acoustic scene.

In Chapter 3, we consider the simplest acoustic scene in this dissertation, where there is only a single source in a reverberant and noiseless environment. The parameters that we aim to estimate are the RTFs, the PSDs of the target signal and the PSDs of the late reverberation. A joint estimator using a single time frame is first proposed, having a closed form. Then, a joint estimator using multiple time frames having the same RTF is proposed, where the solution for each iteration step is in closed form. The parameter estimation accuracy and the additional performance of noise reduction, speech quality and speech intelligibility of the proposed method are compared to various state-of-the-art reference methods. The proposed method reduces computational costs and improves performance as demonstrated by the experiments.

Next, we extend the noiseless signal model in Chapter 3 to the noisy model in Chapters 4 and 5. In Chapter 4, we focus on RTF estimation and propose an estimator that is robust to the late reverberation and noise PSD errors. This is achieved by using only off-diagonal elements of a simplified covariance matrix. The experiments demonstrate the effectiveness of the proposed method. In Chapter 5, a joint estimator of the RTFs, the PSDs of the source, the PSDs of the late reverberation, and the PSDs of the ambient noise is proposed when using a single time frame as well as when using multiple time frames that share the same RTF.

Beyond the acoustic scene of a single point source, in Chapter 6 and 7, we consider the scenario of multiple point sources. In Chapter 6, we first consider the case where the environment is close to non-reverberant and noiseless. Under this assumption, we propose a method to estimate the RTFs. We obtain satisfying estimates by averaging covariance matrices for as many time frames as possible without suffering too much from model mismatch errors caused by distortion signals. This method is based on a comparison of several estimates from different averaged covariance matrices, which is somewhat heuristically motivated and not satisfying for reverberant and noisy environments. Therefore, in Chapter 7, we propose a robust method that works in reverberant and noisy environment and estimates not only the RTFs but also the PSDs of the sources and the late reverberation.

As in most of the works we have introduced, we use the prior information that several consecutive time frames share the same RTF. However, this is only possible if the source stays at the same position during these time frames. In Chapter 8, we therefore propose a method to adaptively segment signals into segments where the source is considered static. The proposed method is combined with the estimator we proposed in Chapter 3 for estimating the parameters of a single non-static source. It is shown in the experiments that with our proposed adaptive time segmentation, the estimation performance is improved over the use of a fixed time segmentation.