Geometry-Guided Video Generation with Diffusion Feature Textures

Master Thesis (2025)
Author(s)

J. Romeu Huidobro (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

Petr Kellnhofer – Mentor (TU Delft - Computer Graphics and Visualisation)

L. Uzolas – Mentor (TU Delft - Computer Graphics and Visualisation)

Xucong Zhang – Graduation committee member (TU Delft - Pattern Recognition and Bioinformatics)

Ricardo Guerra Marroquim – Graduation committee member (TU Delft - Computer Graphics and Visualisation)

Faculty
Electrical Engineering, Mathematics and Computer Science
More Info
expand_more
Publication Year
2025
Language
English
Graduation Date
04-07-2025
Awarding Institution
Delft University of Technology
Programme
['Computer Science']
Faculty
Electrical Engineering, Mathematics and Computer Science
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Recent advances in generative AI have enabled high-quality video generation from text prompts. However, the majority of existing approaches rely exclusively on prompts, making it difficult for an artist to control the generated scene layout and motion. In this thesis, we propose a novel method for geometry-guided Text to Video generation. Our method takes as input an animated mesh sequence and a text prompt and generates a video following both the text prompt and input geometry. Our pipeline consists of two main stages: Firstly, we use an existing text-driven texture generation method to create an initial rough texture for the geometry. Next, a depth-conditioned T2I model is used to generate video frames following the guidance animation, using the generated texture to enforce temporal consistency across frames. By generating video frames rather than directly using the result of the texture generation, our method supports generating deformations from the guidance geometry and variable lighting and by using the texture for feature alignment, we acheive significantly stronger robustness to occlusions and camera motion than existing controllable video-generation approaches. We begin by identifying the failure modes of existing methods through a set of initial experiments, we then use these findings to propose our method and finally evaluate it through a series of comparisons and ablations.

Files

Thesis-jorgeromeu.pdf
(pdf | 129 Mb)
License info not available