PEBL: Pessimistic Ensembles for Offline Deep Reinforcement Learning

Conference Paper (2021)
Author(s)

Jordi Smit (Student TU Delft)

C.T. Ponnambalam (TU Delft - Algorithmics)

Matthijs T. J. Spaan (TU Delft - Algorithmics)

Frans Oliehoek (TU Delft - Interactive Intelligence)

Research Group
Algorithmics
Copyright
© 2021 Jordi Smit, C.T. Ponnambalam, M.T.J. Spaan, F.A. Oliehoek
More Info
expand_more
Publication Year
2021
Language
English
Copyright
© 2021 Jordi Smit, C.T. Ponnambalam, M.T.J. Spaan, F.A. Oliehoek
Research Group
Algorithmics
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Offline reinforcement learning (RL), or learning from a fixed data set, is an attractive alternative to online RL. Offline RL promises to address the cost and safety implications of tak- ing numerous random or bad actions online, a crucial aspect of traditional RL that makes it difficult to apply in real-world problems. However, when RL is na ̈ıvely applied to a fixed data set, the resulting policy may exhibit poor performance in the real environment. This happens due to over-estimation of the value of state-action pairs not sufficiently covered by the data set. A promising way to avoid this is by applying pessimism and acting according to a lower bound estimate on the value. It has been shown that penalizing the learned value according to a pessimistic bound on the uncertainty can drastically improve offline RL. In deep reinforcement learn- ing, however, uncertainty estimation is highly non-trivial and development of effective uncertainty-based pessimistic algo- rithms remains an open question. This paper introduces two novel offline deep RL methods built on Double Deep Q- Learning and Soft Actor-Critic. We show how a multi-headed bootstrap approach to uncertainty estimation is used to cal- culate an effective pessimistic value penalty. Our approach is applied to benchmark offline deep RL domains, where we demonstrate that our methods can often beat the current state- of-the-art.

Files

R2AW_paper_6_1.pdf
(pdf | 0.425 Mb)
License info not available