Generating arbitrarily long, temporally consistent, and visually realistic street-view videos presents a formidable challenge at the intersection of computer vision and graphics. Existing methods, while capable of producing high-quality individual frames or short sequences, often
...
Generating arbitrarily long, temporally consistent, and visually realistic street-view videos presents a formidable challenge at the intersection of computer vision and graphics. Existing methods, while capable of producing high-quality individual frames or short sequences, often struggle to maintain coherence and realism over extended durations. Parallel generation approaches, though effective in ensuring frame-to-frame consistency, face limitations in generating videos of arbitrary length and adapting pre-trained text-to-image models for the video domain. In response to these challenges, we propose a novel sequential approach that draws inspiration from recent advancements in 3D-aware image generation and object reconstruction. Our model, built upon the foundations of state-of-the-art techniques like MagicDrive, incorporates innovative conditioning mechanisms to leverage the visual context of previous frames. By establishing strong temporal dependencies and ensuring smooth transitions between frames, our approach enables the creation of coherent and controllable video sequences of any desired length. Through extensive experimentation on the challenging nuScenes dataset, we demonstrate the effectiveness of our sequential generation framework. Our model not only achieves competitive performance compared to existing parallel methods but also offers enhanced flexibility and computational efficiency. The ability to generate arbitrarily long videos opens up new possibilities for applications such as autonomous vehicle simulation, virtual reality training, and urban planning, where realistic and diverse visual data is crucial. Furthermore, our research contributes to a broader understanding of the trade-offs between sequential and parallel generation paradigms, highlighting the potential of sequential approaches for addressing the limitations of current methods.