The impact of model learning losses on the sample efficiency of MuZero in Atari

Bachelor Thesis (2025)
Author(s)

D.I. Popovici (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

J. He – Mentor (TU Delft - Sequential Decision Making)

FA Oliehoek – Mentor (TU Delft - Sequential Decision Making)

Michael Weinmann – Graduation committee member (TU Delft - Computer Graphics and Visualisation)

Faculty
Electrical Engineering, Mathematics and Computer Science
More Info
expand_more
Publication Year
2025
Language
English
Graduation Date
24-06-2025
Awarding Institution
Delft University of Technology
Project
['CSE3000 Research Project']
Programme
['Computer Science and Engineering']
Faculty
Electrical Engineering, Mathematics and Computer Science
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Recent advances in reinforcement learning (RL) have achieved superhuman performance in various domains but often rely on vast numbers of environment interactions, limiting their practicality in real-world scenarios. MuZero is a RL algorithm that uses Monte Carlo Tree Search with a learned dynamics model, which is trained only to predict rewards, values, and policies, without any explicit objective to match real environment transitions. This work investigates how constraining the learned model of MuZero to follow the real environment dynamics with either a temporal-consistency loss over latent states or a pixel-level observation-reconstruction loss impacts the sample efficiency of MuZero, tested under the Atari100k benchmark. We evaluate performance on Pong, Breakout, and MsPacman analyzing the impact of each loss and its sensitivity to loss weight. Our results show how the temporal-consistency loss can improve performance in certain environments while the observation-reconstruction loss fails to do so, and that both losses are highly sensitive to their weight coefficient, indicating that they might require task-based fine tuning.

Files

Research_paper_final.pdf
(pdf | 1.26 Mb)
License info not available