C. Meo | TU Delft Repository

World Models: Foundations, Applications, and Limitations

Deep Learning Techniques for Sequential Decision Making

Doctoral thesis (2026) - C. Meo, J.H.G. Dauwels, G.J.T. Leus

The ability to predict and model the future is a cornerstone of intelligence, underpinning decision-making and adaptation in dynamic environments. Building intelligent machines carries the potential to advance automation and increase living standards around the world. Recent advancements in deep learning have enabled algorithms to make accurate predictions when large datasets of examples are available, facilitating classification and generation of images and text. Despite this remarkable progress, algorithms still struggle when few examples are available, such as for controlling robots or encountering unforeseen situations. Unlike most learning algorithms, humans quickly adapt to unseen scenarios and learn new skills from relatively small amounts of experience. This ability stems, in part, from internal models of the world, which allow humans to imagine future outcomes of potential actions. Teaching machines to learn world models accurate enough for successful planning has been challenging, especially when dealing with large unstructured inputs such as videos.

This dissertation addresses the challenge of building artificial intelligence systems that can anticipate future states, adapt to novel scenarios, and make robust decisions with limited data by investigating autoregressive deep state-space models for video prediction and world modeling. At its core, the dissertation focuses on two interconnected tasks: the challenge of accurately forecasting future frames in video prediction settings, where pixel-level fidelity and understanding motion, object interactions, and physical rules are crucial, and the broader problem of modeling an environment’s dynamics via actions-latents causal relationships. Even small inaccuracies in predicted frames can compound over extended sequences, creating significant deviations from reality.

Through the lens of model-based reinforcement learning, this dissertation demonstrates that internal rollouts generated by a learned world model can guide action selection, dramatically reducing the number of actual environment interactions needed to reach competent or even expert-level performance. This property is particularly valuable for robotics and other high-stakes domains, where data acquisition can be slow, expensive, or dangerous. To improve the learning behaviour of video prediction and world models, this dissertation presents several inductive biases, such as objective functions that encourage time consistency between frames or that help modeling extreme events, a masked generative prior that improves the sequence modelling capabilities of the dynamics modules, disentangled representations that improve exploration strategies, physics-informed approaches to incorporate physical constraints, and attention-based workspaces to enhance multi-agent coordination.

Although the proposed methods present performance gains in various experimental setups, the real value of this dissertation lies in the versatility of the proposed inductive biases. These biases, built and evaluated across different domains, are designed with the potential for application in large-scale architectures, suggesting that the same algorithmic principles can be repurposed for vision-driven control tasks or for anticipating rare climate events with potentially large societal impacts. These findings bridge a variety of application domains, from simple simulated environments to real-world tasks, illustrating how breakthroughs in generative modeling and self-supervised learning can be systematically harnessed to tackle the complexity of dynamic scenes and interactive decision-making. Improved data efficiency also reduces the environmental footprint of large-scale training regimes.

In conclusion, this dissertation answers research questions that highlight the transformative potential of generative and latent modeling frameworks to reshape how machines perceive, learn about, and ultimately act within the environments they encounter. By bridging latent imagination, generative representations, and self-supervised objectives, this work reveals a path toward artificial systems that not only learn rapidly from experience but also exhibit interpretability and generalization capabilities, bringing us closer to intelligent agents capable of robust, forward-looking reasoning and collaboration.
...

The ability to predict and model the future is a cornerstone of intelligence, underpinning decision-making and adaptation in dynamic environments. Building intelligent machines carries the potential to advance automation and increase living standards around the world. Recent advancements in deep learning have enabled algorithms to make accurate predictions when large datasets of examples are available, facilitating classification and generation of images and text. Despite this remarkable progress, algorithms still struggle when few examples are available, such as for controlling robots or encountering unforeseen situations. Unlike most learning algorithms, humans quickly adapt to unseen scenarios and learn new skills from relatively small amounts of experience. This ability stems, in part, from internal models of the world, which allow humans to imagine future outcomes of potential actions. Teaching machines to learn world models accurate enough for successful planning has been challenging, especially when dealing with large unstructured inputs such as videos.

This dissertation addresses the challenge of building artificial intelligence systems that can anticipate future states, adapt to novel scenarios, and make robust decisions with limited data by investigating autoregressive deep state-space models for video prediction and world modeling. At its core, the dissertation focuses on two interconnected tasks: the challenge of accurately forecasting future frames in video prediction settings, where pixel-level fidelity and understanding motion, object interactions, and physical rules are crucial, and the broader problem of modeling an environment’s dynamics via actions-latents causal relationships. Even small inaccuracies in predicted frames can compound over extended sequences, creating significant deviations from reality.

Through the lens of model-based reinforcement learning, this dissertation demonstrates that internal rollouts generated by a learned world model can guide action selection, dramatically reducing the number of actual environment interactions needed to reach competent or even expert-level performance. This property is particularly valuable for robotics and other high-stakes domains, where data acquisition can be slow, expensive, or dangerous. To improve the learning behaviour of video prediction and world models, this dissertation presents several inductive biases, such as objective functions that encourage time consistency between frames or that help modeling extreme events, a masked generative prior that improves the sequence modelling capabilities of the dynamics modules, disentangled representations that improve exploration strategies, physics-informed approaches to incorporate physical constraints, and attention-based workspaces to enhance multi-agent coordination.

Although the proposed methods present performance gains in various experimental setups, the real value of this dissertation lies in the versatility of the proposed inductive biases. These biases, built and evaluated across different domains, are designed with the potential for application in large-scale architectures, suggesting that the same algorithmic principles can be repurposed for vision-driven control tasks or for anticipating rare climate events with potentially large societal impacts. These findings bridge a variety of application domains, from simple simulated environments to real-world tasks, illustrating how breakthroughs in generative modeling and self-supervised learning can be systematically harnessed to tackle the complexity of dynamic scenes and interactive decision-making. Improved data efficiency also reduces the environmental footprint of large-scale training regimes.

In conclusion, this dissertation answers research questions that highlight the transformative potential of generative and latent modeling frameworks to reshape how machines perceive, learn about, and ultimately act within the environments they encounter. By bridging latent imagination, generative representations, and self-supervised objectives, this work reveals a path toward artificial systems that not only learn rapidly from experience but also exhibit interpretability and generalization capabilities, bringing us closer to intelligent agents capable of robust, forward-looking reasoning and collaboration.

Precipitation Nowcasting Using Physics Informed Discriminator Generative Models

Conference paper (2024) - Junzhe Yin, Cristian Meo, Ankush Roy, Zeineh Bou Cher, Mircea Lică, Yanbo Wang, Ruben Imhoff, Remko Uijlenhoet, Justin Dauwels

Nowcasting leverages real-time atmospheric conditions to forecast weather over short periods. State-of-the-art models, including PySTEPS, encounter difficulties in accurately forecasting extreme weather events because of their unpredictable distribution patterns. In this study, we design a physics-informed neural network to perform precipitation nowcasting using the precipitation and meteorological data from the Royal Netherlands Meteorological Institute (KNMI). This model draws inspiration from the novel Physics-Informed Discriminator GAN (PID-GAN) formulation, directly integrating physics-based supervision within the adversarial learning framework. The proposed model adopts a GAN structure, featuring a Vector Quantization Generative Adversarial Network (VQ-GAN) and a Transformer as the generator, with a temporal discriminator serving as the discriminator. Our findings demonstrate that the PID-GAN model outperforms numerical and SOTA deep generative models in terms of precipitation nowcasting downstream metrics. ...

Nowcasting of Extreme Precipitation Using Deep Generative Models

Conference paper (2023) - Haoran Bi, Maksym Kyryliuk, Zhiyi Wang, Cristian Meo, Yanbo Wang, Ruben Imhoff, Remko Uijlenhoet, Justin Dauwels

Nowcasting is an observation-based method that uses the current state of the atmosphere to forecast future weather conditions over several hours. Recent studies have shown the promising potential of using deep learning models for precipitation nowcasting. In this paper, novel deep generative models are proposed for precipitation nowcasting. These models are equipped with extreme-value losses to more reliably predict extreme precipitation events. The proposed deep generative model contains a Vector Quantization Generative Adversarial Network and a Transformer ("VQGAN + Transformer"). For enhanced modeling and forecasting of extreme events, Extreme Value Loss (EVL) is incorporated in the autore-gressive Transformer. The numerical results show that the proposed model achieves comparable performance with the state-of-the-art conventional nowcasting method PySTEPS for predicting nominal values. By incorporating an EVL, the proposed model yields more accurate nowcasting of extreme precipitation. ...

Image Search Engine by Deep Neural Networks

Conference paper (2022) - Y. Yao, Q. Zhang, Y. HU, C. Meo, Y. Wang, Andrea Nanetti, J.H.G. Dauwels

We typically search for images by keywords, e.g., when looking for images of apples, we would enter the word “apple” as query. However, there are limitations. For example, if users input keywords in a specific language, then they may miss results labeled in other languages. Moreover, users may have an image of the object they want to obtain more information about, e.g., a landmark, but they may not know the name of it. In such scenario, word-based search is not adequate, while imagebased search would be ideally suited. These needs drive us to develop a purely content-based image search engine, meaning that users can search images with an image as query. Motivated by this use case with numerous applications, in this paper we propose and validate an image query based search engine... ...

Extreme Precipitation Nowcasting using Deep Generative Models

Conference paper (2022) - H. Bi, M.S. Kyryliuk, Z. Wang, C. Meo, Y. Wang, Ruben Imhoff, R. Uijlenhoet, J.H.G. Dauwels

Extreme precipitation usually leads to substantial impacts. Floods in the Netherlands, Belgium and Germany in the summer of 2021 have caused loss of lives, destruction of infrastructures, and long-term effect on economics. To avoid such disasters, it is important to develop a reliable and accurate method to predict heavy rain. ...