Making it Clear

Using Vision Transformers in Multi-View Stereo on Specular and Transparent Materials

Master Thesis (2023)
Author(s)

W.E.P. Tolsma (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

N. Tömen – Mentor (TU Delft - Pattern Recognition and Bioinformatics)

Jan C. Gemert – Graduation committee member (TU Delft - Pattern Recognition and Bioinformatics)

Faculty
Electrical Engineering, Mathematics and Computer Science
Copyright
© 2023 Pieter Tolsma
More Info
expand_more
Publication Year
2023
Language
English
Copyright
© 2023 Pieter Tolsma
Graduation Date
11-10-2023
Awarding Institution
Delft University of Technology
Programme
['Computer Science']
Related content

Code and dataset

https://github.com/pietertolsma/ToteMVS
Faculty
Electrical Engineering, Mathematics and Computer Science
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Transparency and specularity are challenging phenomena that modern depth perception systems have to deal with in order to be used in practice. A promising family of depth estimation methods is Multi-View Stereo (MVS), which combines multiple RGB images to predict depth, thus circumventing the need for costly specialized hardware. Although promising, finding pixel-to-pixel mappings between images is a challenging task, clouded by ambiguity. In order to determine the current ability to deal with such ambiguity, we introduce ToteMVS: a multi-view, multi-material synthetic dataset with diffuse, specular and transparent objects. Recent works in computer vision have effectively replaced Convolutional Neural Networks (CNNs) with the emerging Vision Transformer (ViT) architecture, but it remains unclear whether ViTs outperform CNNs in handling reflective and transparent materials. In our study, we use ToteMVS to compare ViT- and CNN-based architectures on the ability to extract useful features for depth estimation on diffuse, specular, and transparent objects. Our results show that, in contrast with the current trend of using ViTs over CNNs, the ViT-based model does not have a special capability for dealing with these challenging materials in the context of MVS. Our evaluation data, including related code, can be found on our \href{https://github.com/pietertolsma/ToteMVS/}{GitHub}.

Files

License info not available