Trust-Region Twisted Policy Improvement

None, None; None, None; None, None; None, None

Trust-Region Twisted Policy Improvement

Journal Article (2025)

Author(s)

Joery A. de Vries (TU Delft - Sequential Decision Making)

Jinke He (TU Delft - Sequential Decision Making)

Yaniv Oren (TU Delft - Sequential Decision Making)

Matthijs T.J. Spaan (TU Delft - Sequential Decision Making)

Research Group

Sequential Decision Making

To reference this document use:

https://resolver.tudelft.nl/uuid:6c2b89a4-e8ad-40b6-be87-025b0d3b6afa

More Info

expand_more

Publication Year

2025

Language

English

Research Group

Sequential Decision Making

Volume number

267

Pages (from-to)

12901-12923

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Monte-Carlo tree search (MCTS) has driven many recent breakthroughs in deep reinforcement learning (RL). However, scalingMCTS to parallel compute has proven challenging in practice which has motivated alternative planners like sequential Monte-Carlo (SMC). Many of these SMC methods adopt particle filters for smoothing through a reformulation of RL as a policy inference problem. Yet, persisting design choices of these particle filters often conflict with the aim of online planning in RL, which is to obtain a policy improvement at the start of planning. Drawing inspiration from MCTS, we tailor SMC planners specifically to RL by improving data generation within the planner through constrained action sampling and explicit terminal state handling, as well as improving policy and value target estimation. This leads to our Trust-Region Twisted SMC (TRT-SMC), which shows improved runtime and sample-efficiency over baseline MCTS and SMC methods in both discrete and continuous domains.

Files

De-vries25a.pdf

(pdf | 1.06 Mb)

License info not available