Deep Exploration by Planning With Uncertainty in Deep Model Based Reinforcement Learning

Master thesis (2022)

Authors

Y. Oren Electrical Engineering, Mathematics and Computer Science

Contributors

J.W. Böhmer Algorithmics - (supervisor 1)

M.T.J. Spaan Algorithmics - (supervisor 1)

Faculty

Electrical Engineering, Mathematics and Computer Science

More Info

expand_more

To reference this document use:

http://resolver.tudelft.nl/uuid:f0bc9065-daa8-4da2-adf9-d78affdb7b99

Published Date

22-07-2022

Language

English

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Faculty

Electrical Engineering, Mathematics and Computer Science

Abstract

Deep, model based reinforcement learning has shown state of the art, human-exceeding performance in many challenging domains. Low sample efficiency and limited exploration remain however as leading obstacles in the field. In this work, we incorporate epistemic uncertainty into planning for better exploration. We develop a low-cost framework for estimating and computing the uncertainty as it propagates in planning with a learned model. We propose a new method, extit{planning for exploration}, that utilizes the propagated uncertainty for inference of the best action for exploration in real time, to achieve exploration that is informed, sequential over multiple time steps and acts with respect to uncertainty in decisions that are multiple steps into the future (deep exploration). To evaluate our method with the state of the art algorithm MuZero, we incorporate different uncertainty estimation mechanisms, modify the Monte-Carlo tree search planning used by MuZero to incorporate our developed framework, and overcome challenges associated with learning from off-policy, exploratory trajectories with an algorithm that learns from on-policy targets. Our results demonstrate that planning for exploration is able to achieve effective deep exploration even when deployed with an algorithm that learns from on-policy targets, and using standard, scalable uncertainty estimation mechanisms. We further provide an ablation study that illustrates that the methodology we propose for on-policy target generation from exploratory trajectories is effective at alleviating averse effects of training with trajectories that have not been sampled from an explotiatory policy. We provide full access to our implementation and our algorithmic contributions through GitHub.

Files

Yaniv_Oren_MSc_Thesis.pdf

(.pdf | 1.24 Mb)