Room geometry estimation from stereo recordings using neural networks

More Info
expand_more

Abstract

Acoustic room geometry estimation is often performed in ad hoc settings, i.e., using multiple microphones and sources distributed around the room, or assuming control over the excitation signals. To facilitate practical applications, we propose a fully convolutional network (FCN) that localizes reflective surfaces under the relaxed assumptions that (i) a compact array of only two microphones is available, (ii) emitter and receivers are not synchronized, and (iii), both the excitation signals and the impulse responses of the enclosures are unknown.
Our FCN is designed to extract spectral and temporal patterns from stereo recordings, aggregate the temporal information over time-frames, and predict the likelihood of virtual sources corresponding to reflective surfaces at specific locations. Whereas most source localization algorithms are limited to direction-of-arrival (DOA) estimation, the proposed method jointly estimates distances and DOAs. Numerical experiments confirm that the network is able to generalize to mismatched microphone array sizes, sensor directivity patterns, or audio signal types, while highlighting front-back ambiguity as a prominent source of uncertainty. When a single reflective surface is present, up to 80% of the sources are detected, while this figure approaches 50% in rectangular rooms.
Further tests on real-world recordings report similar accuracy as with artificially reverberated speech signals, validating the generalization capabilities of the framework.