Action Sampling Strategies in Sampled MuZero for Continuous Control

None, None

Action Sampling Strategies in Sampled MuZero for Continuous Control

A JAX-Based Implementation with Evaluation of Sampling Distributions and Progressive Widening

Bachelor Thesis (2025)

Author(s)

V. Kuboň (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

J. He – Mentor (TU Delft - Sequential Decision Making)

FA Oliehoek – Mentor (TU Delft - Sequential Decision Making)

Michael Weinmann – Graduation committee member (TU Delft - Computer Graphics and Visualisation)

Faculty

Electrical Engineering, Mathematics and Computer Science

Reinforcement Learning Machine learning Model-Based Reinforcement Learning

To reference this document use:

https://resolver.tudelft.nl/uuid:964311c3-49af-4f30-b2f7-823b78c05cd1

More Info

expand_more

Publication Year

2025

Language

English

Coordinates

51.9996,4.3777

Graduation Date

24-06-2025

Awarding Institution

Delft University of Technology

Project

['CSE3000 Research Project']

Programme

['Computer Science and Engineering']

Faculty

Electrical Engineering, Mathematics and Computer Science

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

This work investigates the impact of action sampling strategies on the performance of Sampled MuZero, a reinforcement learning algorithm designed for continuous control settings like robotics. In contrast to discrete domains, continuous action spaces require sampling from a proposal distribution beta during Monte Carlo Tree Search (MCTS), a process that is underexplored despite being central to the algorithm's effectiveness. We systematically study how performance is influenced by (1) the choice of beta distribution and (2) the use of progressive widening, an MCTS augmentation that samples additional actions for frequently visited search tree nodes. Our JAX-based implementation of Sampled MuZero is evaluated on the Brax HalfCheetah environment, testing beta as either a uniform distribution or the agent's policy distribution. Additionally, we examine how different progressive widening parameters affect planning depth and computational efficiency. Results show that while temperature modulation provides marginal benefits under specific conditions, progressive widening with properly calibrated parameters can improve planning depth and episode returns.

Files

Sampled_Muzero_Sampling_resear... (pdf)

(pdf | 4.02 Mb)

License info not available