Making it Clear
Using Vision Transformers in Multi-View Stereo on Specular and Transparent Materials
W.E.P. Tolsma (TU Delft - Electrical Engineering, Mathematics and Computer Science)
N. Tömen – Mentor (TU Delft - Pattern Recognition and Bioinformatics)
Jan C. Gemert – Graduation committee member (TU Delft - Pattern Recognition and Bioinformatics)
More Info
expand_more
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.
Abstract
Transparency and specularity are challenging phenomena that modern depth perception systems have to deal with in order to be used in practice. A promising family of depth estimation methods is Multi-View Stereo (MVS), which combines multiple RGB images to predict depth, thus circumventing the need for costly specialized hardware. Although promising, finding pixel-to-pixel mappings between images is a challenging task, clouded by ambiguity. In order to determine the current ability to deal with such ambiguity, we introduce ToteMVS: a multi-view, multi-material synthetic dataset with diffuse, specular and transparent objects. Recent works in computer vision have effectively replaced Convolutional Neural Networks (CNNs) with the emerging Vision Transformer (ViT) architecture, but it remains unclear whether ViTs outperform CNNs in handling reflective and transparent materials. In our study, we use ToteMVS to compare ViT- and CNN-based architectures on the ability to extract useful features for depth estimation on diffuse, specular, and transparent objects. Our results show that, in contrast with the current trend of using ViTs over CNNs, the ViT-based model does not have a special capability for dealing with these challenging materials in the context of MVS. Our evaluation data, including related code, can be found on our \href{https://github.com/pietertolsma/ToteMVS/}{GitHub}.