Fine-tuning deep RL with gradient-free optimization

Journal Article (2020)
Author(s)

Tim De Bruin (TU Delft - Learning & Autonomous Control)

J. Kober (TU Delft - Learning & Autonomous Control)

Karl Tuyls (Deepmind)

Robert Babuska (TU Delft - Learning & Autonomous Control)

Research Group
Learning & Autonomous Control
Copyright
© 2020 T.D. de Bruin, J. Kober, Karl Tuyls, R. Babuska
DOI related publication
https://doi.org/10.1016/j.ifacol.2020.12.2240
More Info
expand_more
Publication Year
2020
Language
English
Copyright
© 2020 T.D. de Bruin, J. Kober, Karl Tuyls, R. Babuska
Research Group
Learning & Autonomous Control
Issue number
2
Volume number
53
Pages (from-to)
8049-8056
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Deep reinforcement learning makes it possible to train control policies that map high-dimensional observations to actions. These methods typically use gradient-based optimization techniques to enable relatively efficient learning, but are notoriously sensitive to hyperparameter choices and do not have good convergence properties. Gradient-free optimization methods, such as evolutionary strategies, can offer a more stable alternative but tend to be much less sample efficient. In this work we propose a combination, using the relative strengths of both. We start with a gradient-based initial training phase, which is used to quickly learn both a state representation and an initial policy. This phase is followed by a gradient-free optimization of only the final action selection parameters. This enables the policy to improve in a stable manner to a performance level not obtained by gradient-based optimization alone, using many fewer trials than methods using only gradient-free optimization. We demonstrate the effectiveness of the method on two Atari games, a continuous control benchmark and the CarRacing-v0 benchmark. On the latter we surpass the best previously reported score while using significantly fewer episodes.