Generalization and locality in the AlphaZero algorithm

A study in single agent, fully observable, deterministic environments

More Info
expand_more

Abstract

Recently, the AlphaGo algorithm has managed to defeat the top level human player in the game of Go. Achieving professional level performance in the game of Go has long been considered as an AI milestone. The challenging properties of high state-space complexity, long reward horizon and high action branching factor in the game of Go are also shared by many other complex planning problems, such as robotics applications. This makes the algorithmic solutions of AlphaGo particularly interesting for further research. One of the key innovations in the algorithm is the combination of Monte Carlo tree search (MCTS) with deep learning. The main hypothesis of the thesis is that the success of the algorithm can be attributed to the combination of the generalization capacity of deep neural networks and the local information of tree search. This hypothesis is evaluated through the application of the AlphaZero algorithm (extension of Alphago) in single-player, deterministic and fully-observable reinforcement learning environments. The thesis presents answers to two research questions. First, what changes need to be made to transfer the AlphaZero algorithm to these environments. The changes in the reward support in these new environments can cause failure of learning, since assumption in the MCTS algorithm are violated. The thesis offers solutions that deal with this problem, including adaptive return normalization. The second research question examines what is the relative importance of the locality and generalization in the performance of the AlphaZero algorithm. This research question is answered by comparing the performance of search trees of varying sizes in several RL environments under fixed time budgets. While building small trees support generalization, through allowing more frequent training of the neural network, building larger trees provide more accurate estimates. This creates a trade-off between improved generalization capacity and more accurate local information under a fixed time budget. The experiment results show that mid-size trees achieve the best performance, which suggests that balancing local information and generalization is key to the success of the algorithm. Based on this results, possible extensions to the algorithm are proposed. At last, the thesis also highlights the relevance of the two-component system from a broader perspective and discusses the possible future impact of the algorithm.