Scalable Safe Policy Improvement via Monte Carlo Tree Search

Journal Article (2023)
Author(s)

Alberto Castellini (University of Verona)

Federico Bianchi (University of Verona)

Edoardo Zorzi (University of Verona, ETH Zürich)

Thiago D. Simão (Radboud Universiteit Nijmegen)

Alessandro Farinelli (University of Verona)

M. T.J. Spaan (TU Delft - Algorithmics)

Research Group
Algorithmics
Copyright
© 2023 Alberto Castellini, Federico Bianchi, Edoardo Zorzi, Thiago D. Simão, Alessandro Farinelli, M.T.J. Spaan
More Info
expand_more
Publication Year
2023
Language
English
Copyright
© 2023 Alberto Castellini, Federico Bianchi, Edoardo Zorzi, Thiago D. Simão, Alessandro Farinelli, M.T.J. Spaan
Research Group
Algorithmics
Volume number
202
Pages (from-to)
3732-3756
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Algorithms for safely improving policies are important to deploy reinforcement learning approaches in real-world scenarios. In this work, we propose an algorithm, called MCTS-SPIBB, that computes safe policy improvement online using a Monte Carlo Tree Search based strategy. We theoretically prove that the policy generated by MCTS-SPIBB converges, as the number of simulations grows, to the optimal safely improved policy generated by Safe Policy Improvement with Baseline Bootstrapping (SPIBB), a popular algorithm based on policy iteration. Moreover, our empirical analysis performed on three standard benchmark domains shows that MCTS-SPIBB scales to significantly larger problems than SPIBB because it computes the policy online and locally, i.e., only in the states actually visited by the agent.

Files

Castellini23a.pdf
(pdf | 3.06 Mb)
License info not available