Bridging the world of 2D and 3D Computer Vision

Self-Supervised Cross-modality Feature Learning through 3D Gaussian Splatting

More Info
expand_more

Abstract

Current robotic perception systems utilize a variety of sensors to estimate and understand a robot's surroundings. This paper focuses on a novel data representation technique that makes use of a recent scene reconstruction algorithm, known as 3D Gaussian Splatting, to explicitly represent and reason about an environment using only a sparse set of camera views as input. To achieve this, I generate and analyze the first cross-modal dataset consisting of 3D Gaussians and views taken around ten household objects. I introduce the resulting 3D Gaussians and images to a self-supervised feature learning network, that learns robust 2D and 3D embedding representations, by optimizing for the cross-view and cross-modality correspondence pretext tasks. I experiment with several 3D Gaussian features as input to the model and two point sub-network backbones, and report results on the two pretext tasks. The learned features are subsequently fine-tuned for the 2D and 3D shape recognition tasks. Moreover, by leveraging the fast scene reconstruction capabilities of the algorithm, I propose the use of rendered views as a visual memory aid to support downstream robotic tasks. The proposed networks achieve comparable results to state-of-the-art methods for point and image processing. The code associated to this paper is available at https://github.com/SimiOy/Self-Supervised-Learning-for-3DGS.