Radar-guided Monocular Depth Estimation and Point Cloud Fusion for 3D Object Detection

Master Thesis (2022)
Author(s)

S. Baratam (TU Delft - Mechanical Engineering)

Contributor(s)

D.M. Gavrila – Mentor (TU Delft - Intelligent Vehicles)

A. Palffy – Graduation committee member (TU Delft - Intelligent Vehicles)

Faculty
Mechanical Engineering
Copyright
© 2022 Srimannarayana Baratam
More Info
expand_more
Publication Year
2022
Language
English
Copyright
© 2022 Srimannarayana Baratam
Graduation Date
28-06-2022
Awarding Institution
Delft University of Technology
Programme
['Mechanical Engineering']
Faculty
Mechanical Engineering
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Multi-class road user detection using the next- generation, 3+1D (range, azimuth, elevation, and Doppler) radars has been shown feasible, thanks to the increased density of their point clouds and the inclusion of elevation information. However, object detection networks using LiDAR (64-layer) point clouds still dominate the performance metrics. In this work, we explore the potential of fusing a 3+1D radar point cloud and a monocular image to further close this performance gap in 3D object detection. We propose a generic and modular fusion architecture to extract both spatial and semantic cues from an RGB image to complement the radar point cloud. In a two-stage approach, we first generate a 3D point cloud representation of the input monocular image appended with semantic information through our proposed RAID (RAdar guided Instance-aware Depth) network, which takes monocular depth map and panoptic masks predicted from any pre-trained state-of-the-art networks, and a radar depth map as input. We then append the resulting point cloud to the 3+1D radar point cloud in a straightforward fusion scheme and train a point cloud-based object detection network. Results on the View-of-Delft dataset [1] show that our fusion approach significantly outperforms multiple state-of-the-art radar-camera fusion methods (proposed fusion vs. best baseline: 53.6 mAP vs. 50.8 mAP), and yields comparable performance to a network trained on LiDAR input when evaluated in the safety-critical driving corridor (80.5 mAP vs. 81.6 mAP).

Files

S_Baratam_Thesis.pdf
(pdf | 8.97 Mb)
License info not available