SonicVision: Acoustic Object Detection for Autonomous Driving

Master Thesis (2025)
Author(s)

Z. Liu (TU Delft - Mechanical Engineering)

Contributor(s)

S. Wang – Mentor (TU Delft - Intelligent Vehicles)

J.F.P. Kooij – Mentor (TU Delft - Intelligent Vehicles)

Faculty
Mechanical Engineering
More Info
expand_more
Publication Year
2025
Language
English
Graduation Date
26-09-2025
Awarding Institution
Delft University of Technology
Programme
['Mechanical Engineering | Vehicle Engineering | Cognitive Robotics']
Faculty
Mechanical Engineering
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Autonomous driving relies heavily on cameras and LiDAR for 3D perception, yet these vision-based sensors face limitations under poor illumination, adverse weather, or occlusion. Inspired by human hearing, we explore whether microphone arrays can enhance vehicle perception. We propose SonicVision, the first bird’s-eye-view (BEV) acoustic detection framework that jointly localizes and classifies traffic participants using sound alone. Our method employs a horizontally arranged 32-channel microphone array and transforms raw waveforms into short-time Fourier transform (STFT) features augmented with positional embeddings. A ResNet-based architecture is trained with novel Gaussian label representations to predict class-conditioned direction–distance distributions. To support this study, we collect three datasets (simulation, test track, and real road) with synchronized audio and LiDAR, where LiDAR detections serve as pseudo-labels. Experiments show that SonicVision significantly outperforms beamforming-based baselines, achieving accurate localization and classification performance. In some cases, our approach is able to identify objects that are missed by LiDAR, suggesting its potential as both an independent sensor and a complementary modality. These results provide the first evidence that low-cost microphone arrays can meaningfully contribute to 3D perception for autonomous vehicles.

Files

License info not available