C. Meo
Please Note
5 records found
1
World Models: Foundations, Applications, and Limitations
Deep Learning Techniques for Sequential Decision Making
This dissertation addresses the challenge of building artificial intelligence systems that can anticipate future states, adapt to novel scenarios, and make robust decisions with limited data by investigating autoregressive deep state-space models for video prediction and world modeling. At its core, the dissertation focuses on two interconnected tasks: the challenge of accurately forecasting future frames in video prediction settings, where pixel-level fidelity and understanding motion, object interactions, and physical rules are crucial, and the broader problem of modeling an environment’s dynamics via actions-latents causal relationships. Even small inaccuracies in predicted frames can compound over extended sequences, creating significant deviations from reality.
Through the lens of model-based reinforcement learning, this dissertation demonstrates that internal rollouts generated by a learned world model can guide action selection, dramatically reducing the number of actual environment interactions needed to reach competent or even expert-level performance. This property is particularly valuable for robotics and other high-stakes domains, where data acquisition can be slow, expensive, or dangerous. To improve the learning behaviour of video prediction and world models, this dissertation presents several inductive biases, such as objective functions that encourage time consistency between frames or that help modeling extreme events, a masked generative prior that improves the sequence modelling capabilities of the dynamics modules, disentangled representations that improve exploration strategies, physics-informed approaches to incorporate physical constraints, and attention-based workspaces to enhance multi-agent coordination.
Although the proposed methods present performance gains in various experimental setups, the real value of this dissertation lies in the versatility of the proposed inductive biases. These biases, built and evaluated across different domains, are designed with the potential for application in large-scale architectures, suggesting that the same algorithmic principles can be repurposed for vision-driven control tasks or for anticipating rare climate events with potentially large societal impacts. These findings bridge a variety of application domains, from simple simulated environments to real-world tasks, illustrating how breakthroughs in generative modeling and self-supervised learning can be systematically harnessed to tackle the complexity of dynamic scenes and interactive decision-making. Improved data efficiency also reduces the environmental footprint of large-scale training regimes.
In conclusion, this dissertation answers research questions that highlight the transformative potential of generative and latent modeling frameworks to reshape how machines perceive, learn about, and ultimately act within the environments they encounter. By bridging latent imagination, generative representations, and self-supervised objectives, this work reveals a path toward artificial systems that not only learn rapidly from experience but also exhibit interpretability and generalization capabilities, bringing us closer to intelligent agents capable of robust, forward-looking reasoning and collaboration.
...
This dissertation addresses the challenge of building artificial intelligence systems that can anticipate future states, adapt to novel scenarios, and make robust decisions with limited data by investigating autoregressive deep state-space models for video prediction and world modeling. At its core, the dissertation focuses on two interconnected tasks: the challenge of accurately forecasting future frames in video prediction settings, where pixel-level fidelity and understanding motion, object interactions, and physical rules are crucial, and the broader problem of modeling an environment’s dynamics via actions-latents causal relationships. Even small inaccuracies in predicted frames can compound over extended sequences, creating significant deviations from reality.
Through the lens of model-based reinforcement learning, this dissertation demonstrates that internal rollouts generated by a learned world model can guide action selection, dramatically reducing the number of actual environment interactions needed to reach competent or even expert-level performance. This property is particularly valuable for robotics and other high-stakes domains, where data acquisition can be slow, expensive, or dangerous. To improve the learning behaviour of video prediction and world models, this dissertation presents several inductive biases, such as objective functions that encourage time consistency between frames or that help modeling extreme events, a masked generative prior that improves the sequence modelling capabilities of the dynamics modules, disentangled representations that improve exploration strategies, physics-informed approaches to incorporate physical constraints, and attention-based workspaces to enhance multi-agent coordination.
Although the proposed methods present performance gains in various experimental setups, the real value of this dissertation lies in the versatility of the proposed inductive biases. These biases, built and evaluated across different domains, are designed with the potential for application in large-scale architectures, suggesting that the same algorithmic principles can be repurposed for vision-driven control tasks or for anticipating rare climate events with potentially large societal impacts. These findings bridge a variety of application domains, from simple simulated environments to real-world tasks, illustrating how breakthroughs in generative modeling and self-supervised learning can be systematically harnessed to tackle the complexity of dynamic scenes and interactive decision-making. Improved data efficiency also reduces the environmental footprint of large-scale training regimes.
In conclusion, this dissertation answers research questions that highlight the transformative potential of generative and latent modeling frameworks to reshape how machines perceive, learn about, and ultimately act within the environments they encounter. By bridging latent imagination, generative representations, and self-supervised objectives, this work reveals a path toward artificial systems that not only learn rapidly from experience but also exhibit interpretability and generalization capabilities, bringing us closer to intelligent agents capable of robust, forward-looking reasoning and collaboration.
Nowcasting leverages real-time atmospheric conditions to forecast weather over short periods. State-of-the-art models, including PySTEPS, encounter difficulties in accurately forecasting extreme weather events because of their unpredictable distribution patterns. In this study, we design a physics-informed neural network to perform precipitation nowcasting using the precipitation and meteorological data from the Royal Netherlands Meteorological Institute (KNMI). This model draws inspiration from the novel Physics-Informed Discriminator GAN (PID-GAN) formulation, directly integrating physics-based supervision within the adversarial learning framework. The proposed model adopts a GAN structure, featuring a Vector Quantization Generative Adversarial Network (VQ-GAN) and a Transformer as the generator, with a temporal discriminator serving as the discriminator. Our findings demonstrate that the PID-GAN model outperforms numerical and SOTA deep generative models in terms of precipitation nowcasting downstream metrics.