Generative RGB-D Face Completion for Head-Mounted Display Removal

None, None

Generative RGB-D Face Completion for Head-Mounted Display Removal

Master Thesis (2020)

Author(s)

Nels Numan (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

P.S. Cesar Garcia – Mentor (Multimedia Computing)

Frank ter Haar – Mentor (TNO)

R. Guerra Marroquim – Graduation committee member (TU Delft - Electrical Engineering, Mathematics and Computer Science)

J.F.P. Kooij – Graduation committee member (TU Delft - Mechanical Engineering)

Faculty

Electrical Engineering, Mathematics and Computer Science

Virtual reality Image inpainting RGB-D Generative adversarial networks Social VR Image completion Multimodal representation learning

To reference this document use

https://resolver.tudelft.nl/uuid:8984508c-b596-4633-b757-b6eb05f9c66a

More Info

expand_more

Publication Year

2020

Language

English

Graduation Date

20-10-2020

Awarding Institution

Delft University of Technology

Faculty

Electrical Engineering, Mathematics and Computer Science

Downloads counter

351

Collections

thesis

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Virtual reality (VR) creates an exceptional experience in which users can explore virtual environments. Wearing a head-mounted display (HMD), users are able to observe a virtual world that is rendered based on their physical movement and actions. A common solution for capturing the visual and geometric information needed for the construction of virtual environments is the use of RGB-D sensors. These sensors not only capture a collection of RGB data like conventional cameras do, but additionally record a depth value for each pixel. Thus, RGB-D sensors are able to capture both the visual and geometric properties of a space, including any objects or people. This makes immersive social VR experiences possible, where people in different physical locations can be placed in the same virtual environment. However, HMDs obstruct the RGB-D sensor from capturing the wearer's upper face, which severely impacts the social aspects of VR applications. To address this, we proposed a framework that is capable of the virtual removal of head-mounted displays in RGB-D images, which is referred to as the task of HMD removal. Due to its novelty, we took an exploratory approach to this task. We formulated this problem as a joint RGB-D face image inpainting task and proposed a GAN-based coarse-to-fine architecture that is capable of simultaneously filling in the missing color and depth information of face images occluded by an HMD. To preserve the identity features of the inpainted faces, we proposed an RGB-based identity loss function. Leveraging the knowledge of a pretrained identity embedding model, this perceptual loss function stimulates the preservation of identity-specific facial features. Furthermore, we proposed several architectural structures to explore multimodal feature fusion of the color and depth information contained in RGB-D images. To this end, we introduced data-level fusion, which naively combines the color and depth information at network input. In addition, we introduced hybrid fusion, which involves feature-level fusion in the coarse stage of the architecture and data-level fusion in the refinement stage of the architecture. Within the concept of hybrid fusion, we investigated several fusion strategies, including residual fusion. Our findings suggest that data-level fusion achieves similar performance to hybrid fusion.

Moreover, to improve surface reproduction in the depth channel, we introduced the employment of a surface normal loss function and contextual surface attention module, which both rely on surface normals that are estimated based on the depth channel of the RGB-D image. We also considered the addition of surface normal information to the discriminator input, which we found to have an adverse effect on the visual quality of the results.

In absence of a large scale RGB-D face dataset, we devised a pipeline for the creation of a synthetic RGB-D face dataset for the evaluation of our network. Despite its exploratory nature, our research provides unique insights into the design and behavior of a multimodal image inpainting architecture that can be of interest to future research.

Files

NelsNuman_MasterThesis.pdf

(pdf | 16.9 Mb)

License info not available