Making it Clear

Using Vision Transformers in Multi-View Stereo on Specular and Transparent Materials

More Info
expand_more

Abstract

Transparency and specularity are challenging phenomena that modern depth perception systems have to deal with in order to be used in practice. A promising family of depth estimation methods is Multi-View Stereo (MVS), which combines multiple RGB images to predict depth, thus circumventing the need for costly specialized hardware. Although promising, finding pixel-to-pixel mappings between images is a challenging task, clouded by ambiguity. In order to determine the current ability to deal with such ambiguity, we introduce ToteMVS: a multi-view, multi-material synthetic dataset with diffuse, specular and transparent objects. Recent works in computer vision have effectively replaced Convolutional Neural Networks (CNNs) with the emerging Vision Transformer (ViT) architecture, but it remains unclear whether ViTs outperform CNNs in handling reflective and transparent materials. In our study, we use ToteMVS to compare ViT- and CNN-based architectures on the ability to extract useful features for depth estimation on diffuse, specular, and transparent objects. Our results show that, in contrast with the current trend of using ViTs over CNNs, the ViT-based model does not have a special capability for dealing with these challenging materials in the context of MVS. Our evaluation data, including related code, can be found on our \href{https://github.com/pietertolsma/ToteMVS/}{GitHub}.