W.B. Kleijn
Please Note
15 records found
1
Mapping a room impulse response (RIR) to its Ambisonics representation is not always feasible. However, by adding a weak assumption (i.e., the existence of at least two perpendicular walls in the environment), the Ambisonics representation is restricted to be one of a finite set, with known transformations between the set entries. This makes mapping the omnidirectional RIR to the Ambisonics RIR (ARIR) possible. The authors solve the mapping problem with a convolutional neural network and multi-task variational autoencoder. The room is assumed to be rectangular. The proposed method is based on the image source method with frequency-independent reflection coefficients exclusively. The authors focus on the early part of RIRs, where the directional information lies. This method requires only a single RIR. Generalizing to the real world, measurements can obviate the need for specialized hardware for Ambisonics measurement. The proposed method can achieve an SNR of 17.62 dB on estimated first-order ARIRs and 16.15 dB on estimated third-order ARIRs.
We describe a new method to estimate the geometry of a room and reflection coefficients given room impulse responses. The method utilizes convolutional neural networks to estimate the room geometry and multilayer perceptrons to estimate the reflection coefficients. The mean square error is used as the loss function. In contrast to existing methods, we do not require the knowledge of the relative positions of sources and receivers in the room. The method can be used with only a single RIR between one source and one receiver. For simulated environments, the proposed estimation method can achieve an average of 0.04 m accuracy for each dimension in room geometry estimation and 0.09 accuracy in reflection coefficients. For real-world environments, the room geometry estimation method achieves an accuracy of an average of 0.065 m for each dimension.
In this paper, we present a novel derivation of an existing algorithm for distributed optimization termed the primal-dual method of multipliers (PDMM). In contrast to its initial derivation, monotone operator theory is used to connect PDMM with other first-order methods such as Douglas-Rachford splitting and the alternating direction method of multipliers, thus, providing insight into its operation. In particular, we show how PDMM combines a lifted dual form in conjunction with Peaceman-Rachford splitting to facilitate distributed optimization in undirected networks. We additionally demonstrate sufficient conditions for primal convergence for strongly convex differentiable functions and strengthen this result for strongly convex functions with Lipschitz continuous gradients by introducing a primal geometric convergence bound.
We investigate what information about a room is necessary to integrate a new source into an existing scenario. In particular, we consider the effects of the reflection order, the order of ambisonics signals and reverberation time. We conducted a series of listening tests and used the control variates method to determine the quantitative relevance of the selected attributes. In terms of integration and accurate localisation, at least third order ambisonics description of a source, is required for integration of that source. In addition, a finite number of early reflections can perform equally well to a full room impulse response when a new source is integrated into an existing scenario. However, the room impulse response with only the correct reverberation time is not sufficient.
We propose a monaural intrusive instrumental intelligibility metric called SIIB (speech intelligibility in bits). SIIB is an estimate of the amount of information shared between a talker and a listener in bits per second. Unlike existing information theoretic intelligibility metrics, SIIB accounts for talker variability and statistical dependencies between time-frequency units. Our evaluation shows that relative to state-of-the-art intelligibility metrics, SIIB is highly correlated with the intelligibility of speech that has been degraded by noise and processed by speech enhancement algorithms.
Instrumental intelligibility metrics are commonly used as an alternative to listening tests. This paper evaluates 12 monaural intrusive intelligibility metrics: SII, HEGP, CSII, HASPI, NCM, QSTI, STOI, ESTOI, MIKNN, SIMI, SIIB, and sEPSMcorr. In addition, this paper investigates the ability of intelligibility metrics to generalize to new types of distortions and analyzes why the top performing metrics have high performance. The intelligibility data were obtained from 11 listening tests described in the literature. The stimuli included Dutch, Danish, and English speech that was distorted by additive noise, reverberation, competing talkers, preprocessing enhancement, and postprocessing enhancement. SIIB and HASPI had the highest performance achieving a correlation with listening test scores on average of ρ =0.92 and ρ =0.89, respectively. The high performance of SIIB may, in part, be the result of SIIBs developers having access to all the intelligibility data considered in the evaluation. The results show that intelligibility metrics tend to perform poorly on datasets that were not used during their development. By modifying the original implementations of SIIB and STOI, the advantage of reducing statistical dependencies between input features is demonstrated. Additionally, this paper presents a new version of SIIB called SIIBGauss, which has similar performance to SIIB and HASPI, but takes less time to compute by two orders of magnitude.
Speech intelligibility enhancement is considered for multiple-microphone acquisition and single loudspeaker rendering. This is based on the mutual information measured between the message spoken at far-end environment and the message perceived by a listener at near-end. We prove that the joint optimal processing can be decomposed into far-end and near-end processing. The former is a minimum variance distortionless response beamformer that reduces the noise in the talker environment and the latter is a post-filter that redistributes the power over the frequency bands. Disjoint processing is optimal provided that the post-filtering operation is aware of the residual noise from the beamforming operation. Our results show that both processing steps are necessary for the effective conveyance of a message and, importantly, that the second step must be aware of the remaining noise from the beamforming operation in the first step. In addition, we study the use of the mutual information applied on the perceptually more relevant powers per critical band.
This paper addresses the problem of joint wideband localization and acquisition of acoustic sources. The source locations as well as acquisition of the original source signals are obtained in a joint fashion by solving a sparse recovery problem. Spatial sparsity is enforced by discretizing the acoustic scene into a grid of predefined dimensions. In practice, energy leakage from the source location to the neighboring grid points is expected to produce spurious location estimates, since the source location will not coincide with one of the grid points. To alleviate this problem we introduce the concept of grid-shift. A particular source is then near a point on the grid in at least one of a set of shifted grids. For the selected grid, other sources will generally not be on a grid point, but their energy is distributed over many points. A large number of experiments on real speech signals show the localization and acquisition effectiveness of the proposed approach under clean, noisy and reverberant conditions.
...