Bridging the world of 2D and 3D Computer Vision

Self-Supervised Cross-modality Feature Learning through 3D Gaussian Splatting

Bachelor thesis (2024)

Authors

A. Simionescu Electrical Engineering, Mathematics and Computer Science

Contributors

X. Zhang Pattern Recognition and Bioinformatics - (mentor)

M. Weinmann Computer Graphics and Visualisation - (graduation committee member)

Faculty

Electrical Engineering, Mathematics and Computer Science

More Info

expand_more

To reference this document use:

http://resolver.tudelft.nl/uuid:6b691470-d326-4bf4-9ee8-8486cb88e1eb

Published Date

25-06-2024

Language

English

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Faculty

Electrical Engineering, Mathematics and Computer Science

Abstract

Current robotic perception systems utilize a variety of sensors to estimate and understand a robot's surroundings. This paper focuses on a novel data representation technique that makes use of a recent scene reconstruction algorithm, known as 3D Gaussian Splatting, to explicitly represent and reason about an environment using only a sparse set of camera views as input. To achieve this, I generate and analyze the first cross-modal dataset consisting of 3D Gaussians and views taken around ten household objects. I introduce the resulting 3D Gaussians and images to a self-supervised feature learning network, that learns robust 2D and 3D embedding representations, by optimizing for the cross-view and cross-modality correspondence pretext tasks. I experiment with several 3D Gaussian features as input to the model and two point sub-network backbones, and report results on the two pretext tasks. The learned features are subsequently fine-tuned for the 2D and 3D shape recognition tasks. Moreover, by leveraging the fast scene reconstruction capabilities of the algorithm, I propose the use of rendered views as a visual memory aid to support downstream robotic tasks. The proposed networks achieve comparable results to state-of-the-art methods for point and image processing. The code associated to this paper is available at https://github.com/SimiOy/Self-Supervised-Learning-for-3DGS.

Files

Cross_Modality_feature_learnin... (.pdf)

(.pdf | 2.24 Mb)