ViewFormer

NeRF-Free Neural Rendering from Few Images Using Transformers

Conference Paper (2022)
Author(s)

Jonáš Kulhánek (Czech Technical University)

Erik Derner (Czech Technical University)

Torsten Sattler (Czech Technical University)

Robert Babuska (Czech Technical University, TU Delft - Learning & Autonomous Control)

Research Group
Learning & Autonomous Control
Copyright
© 2022 Jonáš Kulhánek, Erik Derner, Torsten Sattler, R. Babuska
DOI related publication
https://doi.org/10.1007/978-3-031-19784-0_12
More Info
expand_more
Publication Year
2022
Language
English
Copyright
© 2022 Jonáš Kulhánek, Erik Derner, Torsten Sattler, R. Babuska
Research Group
Learning & Autonomous Control
Bibliographical Note
Green Open Access added to TU Delft Institutional Repository 'You share, we take care!' - Taverne project https://www.openaccess.nl/en/you-share-we-take-care Otherwise as indicated in the copyright section: the publisher is the copyright holder of this work and the author uses the Dutch legislation to make this work public.@en
Pages (from-to)
198-216
ISBN (print)
978-3-031-19783-3
ISBN (electronic)
978-3-031-19784-0
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Novel view synthesis is a long-standing problem. In this work, we consider a variant of the problem where we are given only a few context views sparsely covering a scene or an object. The goal is to predict novel viewpoints in the scene, which requires learning priors. The current state of the art is based on Neural Radiance Field (NeRF), and while achieving impressive results, the methods suffer from long training times as they require evaluating millions of 3D point samples via a neural network for each image. We propose a 2D-only method that maps multiple context views and a query pose to a new image in a single pass of a neural network. Our model uses a two-stage architecture consisting of a codebook and a transformer model. The codebook is used to embed individual images into a smaller latent space, and the transformer solves the view synthesis task in this more compact space. To train our model efficiently, we introduce a novel branching attention mechanism that allows us to use the same model not only for neural rendering but also for camera pose estimation. Experimental results on real-world scenes show that our approach is competitive compared to NeRF-based methods while not reasoning explicitly in 3D, and it is faster to train.

Files

978_3_031_19784_0_12.pdf
(pdf | 2.66 Mb)
- Embargo expired in 01-07-2023
License info not available