Radar-guided Monocular Depth Estimation and Point Cloud Fusion for 3D Object Detection

Master thesis (2022)

Authors

S. Baratam Mechanical Engineering

Contributors

D. Gavrila Intelligent Vehicles - Mechanical, Maritime and Materials Engineering (supervisor 1)

A. Palffy Intelligent Vehicles - Mechanical, Maritime and Materials Engineering (supervisor 2)

Faculty

Mechanical Engineering

Radar LiDAR Camera Sensor Fusion

More Info

expand_more

To reference this document use:

http://resolver.tudelft.nl/uuid:27d7fa76-3f4d-4f75-a915-b5b285b3608a

Published Date

28-06-2022

Language

English

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Faculty

Mechanical Engineering

Abstract

Multi-class road user detection using the next- generation, 3+1D (range, azimuth, elevation, and Doppler) radars has been shown feasible, thanks to the increased density of their point clouds and the inclusion of elevation information. However, object detection networks using LiDAR (64-layer) point clouds still dominate the performance metrics. In this work, we explore the potential of fusing a 3+1D radar point cloud and a monocular image to further close this performance gap in 3D object detection. We propose a generic and modular fusion architecture to extract both spatial and semantic cues from an RGB image to complement the radar point cloud. In a two-stage approach, we first generate a 3D point cloud representation of the input monocular image appended with semantic information through our proposed RAID (RAdar guided Instance-aware Depth) network, which takes monocular depth map and panoptic masks predicted from any pre-trained state-of-the-art networks, and a radar depth map as input. We then append the resulting point cloud to the 3+1D radar point cloud in a straightforward fusion scheme and train a point cloud-based object detection network. Results on the View-of-Delft dataset [1] show that our fusion approach significantly outperforms multiple state-of-the-art radar-camera fusion methods (proposed fusion vs. best baseline: 53.6 mAP vs. 50.8 mAP), and yields comparable performance to a network trained on LiDAR input when evaluated in the safety-critical driving corridor (80.5 mAP vs. 81.6 mAP).

Files

S_Baratam_Thesis.pdf

(.pdf | 8.97 Mb)