Action Sampling Strategies in Sampled MuZero for Continuous Control
A JAX-Based Implementation with Evaluation of Sampling Distributions and Progressive Widening
V. Kuboň (TU Delft - Electrical Engineering, Mathematics and Computer Science)
J. He – Mentor (TU Delft - Sequential Decision Making)
FA Oliehoek – Mentor (TU Delft - Sequential Decision Making)
Michael Weinmann – Graduation committee member (TU Delft - Computer Graphics and Visualisation)
More Info
expand_more
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.
Abstract
This work investigates the impact of action sampling strategies on the performance of Sampled MuZero, a reinforcement learning algorithm designed for continuous control settings like robotics. In contrast to discrete domains, continuous action spaces require sampling from a proposal distribution beta during Monte Carlo Tree Search (MCTS), a process that is underexplored despite being central to the algorithm's effectiveness. We systematically study how performance is influenced by (1) the choice of beta distribution and (2) the use of progressive widening, an MCTS augmentation that samples additional actions for frequently visited search tree nodes. Our JAX-based implementation of Sampled MuZero is evaluated on the Brax HalfCheetah environment, testing beta as either a uniform distribution or the agent's policy distribution. Additionally, we examine how different progressive widening parameters affect planning depth and computational efficiency. Results show that while temperature modulation provides marginal benefits under specific conditions, progressive widening with properly calibrated parameters can improve planning depth and episode returns.