Autonomous driving relies heavily on cameras and LiDAR for 3D perception, yet these vision-based sensors face limitations under poor illumination, adverse weather, or occlusion. Inspired by human hearing, we explore whether microphone arrays can enhance vehicle perception. We pro
...
Autonomous driving relies heavily on cameras and LiDAR for 3D perception, yet these vision-based sensors face limitations under poor illumination, adverse weather, or occlusion. Inspired by human hearing, we explore whether microphone arrays can enhance vehicle perception. We propose SonicVision, the first bird’s-eye-view (BEV) acoustic detection framework that jointly localizes and classifies traffic participants using sound alone. Our method employs a horizontally arranged 32-channel microphone array and transforms raw waveforms into short-time Fourier transform (STFT) features augmented with positional embeddings. A ResNet-based architecture is trained with novel Gaussian label representations to predict class-conditioned direction–distance distributions. To support this study, we collect three datasets (simulation, test track, and real road) with synchronized audio and LiDAR, where LiDAR detections serve as pseudo-labels. Experiments show that SonicVision significantly outperforms beamforming-based baselines, achieving accurate localization and classification performance. In some cases, our approach is able to identify objects that are missed by LiDAR, suggesting its potential as both an independent sensor and a complementary modality. These results provide the first evidence that low-cost microphone arrays can meaningfully contribute to 3D perception for autonomous vehicles.