Trust-Region Twisted Policy Improvement

Journal Article (2025)
Author(s)

Joery A. de Vries (TU Delft - Sequential Decision Making)

Jinke He (TU Delft - Sequential Decision Making)

Yaniv Oren (TU Delft - Sequential Decision Making)

Matthijs T.J. Spaan (TU Delft - Sequential Decision Making)

Research Group
Sequential Decision Making
More Info
expand_more
Publication Year
2025
Language
English
Research Group
Sequential Decision Making
Volume number
267
Pages (from-to)
12901-12923
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Monte-Carlo tree search (MCTS) has driven many recent breakthroughs in deep reinforcement learning (RL). However, scalingMCTS to parallel compute has proven challenging in practice which has motivated alternative planners like sequential Monte-Carlo (SMC). Many of these SMC methods adopt particle filters for smoothing through a reformulation of RL as a policy inference problem. Yet, persisting design choices of these particle filters often conflict with the aim of online planning in RL, which is to obtain a policy improvement at the start of planning. Drawing inspiration from MCTS, we tailor SMC planners specifically to RL by improving data generation within the planner through constrained action sampling and explicit terminal state handling, as well as improving policy and value target estimation. This leads to our Trust-Region Twisted SMC (TRT-SMC), which shows improved runtime and sample-efficiency over baseline MCTS and SMC methods in both discrete and continuous domains.

Files

De-vries25a.pdf
(pdf | 1.06 Mb)
License info not available