Fine-tuning deep RL with gradient-free optimization

None, None; None, None; None, None; None, None

Fine-tuning deep RL with gradient-free optimization

Journal Article (2020)

Author(s)

Tim De Bruin (TU Delft - Learning & Autonomous Control)

J. Kober (TU Delft - Learning & Autonomous Control)

Karl Tuyls (Deepmind)

R. Babuska (TU Delft - Learning & Autonomous Control)

Research Group

Learning & Autonomous Control

Copyright

DOI related publication

https://doi.org/10.1016/j.ifacol.2020.12.2240

Optimization Deep learning Control Neural networks Reinforcement learning

To reference this document use:

https://resolver.tudelft.nl/uuid:b85a8866-3a36-49b0-a54f-cc0c700e842a

More Info

expand_more

Publication Year

2020

Language

English

Copyright

Research Group

Learning & Autonomous Control

Issue number

2

Volume number

53

Pages (from-to)

8049-8056

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Deep reinforcement learning makes it possible to train control policies that map high-dimensional observations to actions. These methods typically use gradient-based optimization techniques to enable relatively efficient learning, but are notoriously sensitive to hyperparameter choices and do not have good convergence properties. Gradient-free optimization methods, such as evolutionary strategies, can offer a more stable alternative but tend to be much less sample efficient. In this work we propose a combination, using the relative strengths of both. We start with a gradient-based initial training phase, which is used to quickly learn both a state representation and an initial policy. This phase is followed by a gradient-free optimization of only the final action selection parameters. This enables the policy to improve in a stable manner to a performance level not obtained by gradient-based optimization alone, using many fewer trials than methods using only gradient-free optimization. We demonstrate the effectiveness of the method on two Atari games, a continuous control benchmark and the CarRacing-v0 benchmark. On the latter we surpass the best previously reported score while using significantly fewer episodes.

Files

1_s2.0_S2405896320329001_main.... (pdf)

(pdf | 0.672 Mb)