Recent advances in generative AI have enabled high-quality video generation from text prompts. However, the majority of existing approaches rely exclusively on prompts, making it difficult for an artist to control the generated scene layout and motion. In this thesis, we propose
...
Recent advances in generative AI have enabled high-quality video generation from text prompts. However, the majority of existing approaches rely exclusively on prompts, making it difficult for an artist to control the generated scene layout and motion. In this thesis, we propose a novel method for geometry-guided Text to Video generation. Our method takes as input an animated mesh sequence and a text prompt and generates a video following both the text prompt and input geometry. Our pipeline consists of two main stages: Firstly, we use an existing text-driven texture generation method to create an initial rough texture for the geometry. Next, a depth-conditioned T2I model is used to generate video frames following the guidance animation, using the generated texture to enforce temporal consistency across frames. By generating video frames rather than directly using the result of the texture generation, our method supports generating deformations from the guidance geometry and variable lighting and by using the texture for feature alignment, we acheive significantly stronger robustness to occlusions and camera motion than existing controllable video-generation approaches. We begin by identifying the failure modes of existing methods through a set of initial experiments, we then use these findings to propose our method and finally evaluate it through a series of comparisons and ablations.