World Models: Foundations, Applications, and Limitations
Deep Learning Techniques for Sequential Decision Making
C. Meo (TU Delft - Signal Processing Systems)
J.H.G. Dauwels – Promotor (TU Delft - Signal Processing Systems)
G.J.T. Leus – Promotor (TU Delft - Signal Processing Systems)
More Info
expand_more
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.
Abstract
The ability to predict and model the future is a cornerstone of intelligence, underpinning decision-making and adaptation in dynamic environments. Building intelligent machines carries the potential to advance automation and increase living standards around the world. Recent advancements in deep learning have enabled algorithms to make accurate predictions when large datasets of examples are available, facilitating classification and generation of images and text. Despite this remarkable progress, algorithms still struggle when few examples are available, such as for controlling robots or encountering unforeseen situations. Unlike most learning algorithms, humans quickly adapt to unseen scenarios and learn new skills from relatively small amounts of experience. This ability stems, in part, from internal models of the world, which allow humans to imagine future outcomes of potential actions. Teaching machines to learn world models accurate enough for successful planning has been challenging, especially when dealing with large unstructured inputs such as videos.
This dissertation addresses the challenge of building artificial intelligence systems that can anticipate future states, adapt to novel scenarios, and make robust decisions with limited data by investigating autoregressive deep state-space models for video prediction and world modeling. At its core, the dissertation focuses on two interconnected tasks: the challenge of accurately forecasting future frames in video prediction settings, where pixel-level fidelity and understanding motion, object interactions, and physical rules are crucial, and the broader problem of modeling an environment’s dynamics via actions-latents causal relationships. Even small inaccuracies in predicted frames can compound over extended sequences, creating significant deviations from reality.
Through the lens of model-based reinforcement learning, this dissertation demonstrates that internal rollouts generated by a learned world model can guide action selection, dramatically reducing the number of actual environment interactions needed to reach competent or even expert-level performance. This property is particularly valuable for robotics and other high-stakes domains, where data acquisition can be slow, expensive, or dangerous. To improve the learning behaviour of video prediction and world models, this dissertation presents several inductive biases, such as objective functions that encourage time consistency between frames or that help modeling extreme events, a masked generative prior that improves the sequence modelling capabilities of the dynamics modules, disentangled representations that improve exploration strategies, physics-informed approaches to incorporate physical constraints, and attention-based workspaces to enhance multi-agent coordination.
Although the proposed methods present performance gains in various experimental setups, the real value of this dissertation lies in the versatility of the proposed inductive biases. These biases, built and evaluated across different domains, are designed with the potential for application in large-scale architectures, suggesting that the same algorithmic principles can be repurposed for vision-driven control tasks or for anticipating rare climate events with potentially large societal impacts. These findings bridge a variety of application domains, from simple simulated environments to real-world tasks, illustrating how breakthroughs in generative modeling and self-supervised learning can be systematically harnessed to tackle the complexity of dynamic scenes and interactive decision-making. Improved data efficiency also reduces the environmental footprint of large-scale training regimes.
In conclusion, this dissertation answers research questions that highlight the transformative potential of generative and latent modeling frameworks to reshape how machines perceive, learn about, and ultimately act within the environments they encounter. By bridging latent imagination, generative representations, and self-supervised objectives, this work reveals a path toward artificial systems that not only learn rapidly from experience but also exhibit interpretability and generalization capabilities, bringing us closer to intelligent agents capable of robust, forward-looking reasoning and collaboration.