Unsupervised Domain Adaptation for Multi-Modal 3D Object Detection under Asymmetric Sensor Degradation
M.D. Yang (TU Delft - Mechanical Engineering)
S. Wang – Mentor (TU Delft - Mechanical Engineering)
J.F.P. Kooij – Mentor (TU Delft - Mechanical Engineering)
More Info
expand_more
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.
Abstract
Multi-modal 3D object detectors achieve state-of-the-art performance but remain notoriously brittle to asymmetric sensor degradation, such as when LiDAR point clouds become sparse in new environments. In this paper, we investigate unsupervised cross-modal adaptation to rescue a degraded sensor using an unaffected reference modality, without requiring target-domain labels. Using UniBEV on the nuScenes dataset, we simulate severe degradation by reducing LiDAR resolution from 32 to 8 beams. We systematically compare two leading adaptation paradigms anchored by the reliable camera stream: output-level camera pseudo-labeling and feature-level cross-modal mapping via a Bird's-Eye-View (BEV) Attention U-Net. Our experiments reveal a compelling insight: while feature mapping successfully aligns coarse spatial structures (improving LiDAR-only mAP by 5.6%), it fails to preserve fine-grained localization metrics. In contrast, simple confidence-filtered pseudo-labeling provides a significantly stronger recovery, yielding a 13.1% mAP improvement. Ultimately, our findings suggest that basic feature-level alignment may be insufficient to restore fine-grained 3D detection under severe spatial degradation, indicating that direct output-level supervision can be a more effective and reliable strategy for cross-modal adaptation in this regime.