Improving (ε,δ)-PAC Monte Carlo tree search using spherical confidence regions

Master Thesis (2020)
Author(s)

Thomas Schiet (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

K.I. Aardal – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

W.M. Koolen – Mentor (Centrum Wiskunde & Informatica (CWI))

F.A. Oliehoek – Graduation committee member (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Faculty
Electrical Engineering, Mathematics and Computer Science
More Info
expand_more
Publication Year
2020
Language
English
Graduation Date
14-12-2020
Awarding Institution
Delft University of Technology
Programme
Applied Mathematics
Faculty
Electrical Engineering, Mathematics and Computer Science
Downloads counter
269
Collections
thesis
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

We consider a simplified version of the Monte Carlo tree search (MCTS) problem, a problem where, given a game tree with stochastic reward, one is tasked with finding the best move from the root. This problem is well studied, and recently impressive results have been obtained. For example, in 2016, when the AlphaGo program beat the professional Go player Lee Sedol. Even though these results were spectacular, not much is known about the guaranteed accuracy of these algorithms. In this work, we review the FindTopWinner algorithm for the MCTS problem. The FindTopWinner algorithm is guaranteed to give an accurate result with high probability, as is formalized in the (ε,δ)-PAC learning framework. However, its sample complexity is not optimal. We improve FindTopWinner by introducing the Spherical Confidence Region- MCTS (SCR-MCTS) algorithm, which operates by creating a spherical confidence region on the joint values of the leaves in multiple iterations called epochs. We show that SCR-MCTS has the (ε,δ)-PAC guarantee, and we show that empirically SCR-MCTS draws fewer samples than FindTopWinner does on various benchmark trees.

Files

MSc_thesis.pdf
(pdf | 1.02 Mb)
License info not available