EfficientTDMPC
Improved MPC Objectives for Sample-Efficient Continuous Control
T. Evers (TU Delft - Electrical Engineering, Mathematics and Computer Science)
J.H.G. Dauwels – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)
More Info
expand_more
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.
Abstract
Model-based reinforcement learning can improve sample efficiency by learning a model of the environment and using it for planning. In continuous-control tasks, this allows an agent to evaluate many candidate action sequences before acting. However, planning with a learned model also introduces a failure mode: the planner optimizes a learned return estimate rather than the true environment return, and can therefore exploit errors in the dynamics model, reward model, or value function.
This thesis studies how the model predictive control objective used in TD-MPC-style agents can be made more reliable. The main contribution is EfficientTDMPC, a method that modifies the planner objective by aggregating return estimates across multiple dynamics heads and rollout depths, and by applying disagreement-based pessimism during reanalyze. These changes aim to reduce the variance and exploitability of model-based return estimates while preserving the sample efficiency benefits of latent model-based planning.
EfficientTDMPC is the new state-of-the-art on HumanoidBench-Hard and the hard DeepMind Control Suite (DMC) while matching the state-of-the-art on Easy DMC. The thesis also discusses adaptive horizon selection as a future direction, arguing that planning depth should be treated as part of the uncertainty-aware planning objective. Overall, the thesis shows that the reliability of the learned planning objective is a central design problem in sample-efficient model-based reinforcement learning.