Zero-shot and few-shot learning for 3D Object Detection Using Language Model

More Info
expand_more

Abstract

This research introduces a novel approach for 3D object detection utilizing language models, with a particular focus on addressing the challenges that have been encountered in the autonomous vehicle domain. The primary objective revolves around addressing the constraints associated with object detection models that rely heavily on labeled data and are resource-intensive, while also showing limited proficiency in recognizing new, unseen objects. The thesis presents the PointGLIP model, a novel integration of zero-shot and few-shot learning methodologies, that utilizes language models to augment object detection in 3D point cloud data. PointGLIP utilizes GLIP encoders, which are known for their capability to combine textual and visual data. This methodology facilitates the transfer of pre-trained information from a 2D domain to a 3D domain by converting point clouds into depth maps. This conversion process enables object detection without the need for extensive or any prior 3D training. The model’s ability to generalize and transfer learning from 2D to 3D environments is demonstrated by experiments conducted utilizing the nuScenes dataset. The results revealed that the model shows poor performance in the context of both zero-shot and few-shot learning. This shows that the model struggles to perform detections with depth maps which indicates a struggle in the transfer of knowledge from 2D to 3D contexts. In 2D, the integration of descriptive text through language models provides a unique approach to contextual understanding, though the outcomes demonstrated that further refinements are necessary for consistent reliability.