Fine-tuning deep RL with gradient-free optimization

Journal article (2020)

Authors

T.D. de Bruin Learning & Autonomous Control - Mechanical, Maritime and Materials Engineering

J. Kober Learning & Autonomous Control - Mechanical, Maritime and Materials Engineering

Karl Tuyls Deepmind

R Babuska Learning & Autonomous Control - Mechanical, Maritime and Materials Engineering

R. Babuska Learning & Autonomous Control - Mechanical, Maritime and Materials Engineering

Research Group

Learning & Autonomous Control (Mechanical, Maritime and Materials Engineering) (TU Delft)

DOI: https://doi.org/10.1016/j.ifacol.2020.12.2240

Optimization Deep learning Control Neural networks Reinforcement learning

To reference this document use:

http://resolver.tudelft.nl/uuid:b85a8866-3a36-49b0-a54f-cc0c700e842a

More Info

expand_more

Published Date

2020

Language

English

Faculty

Mechanical, Maritime and Materials Engineering

Department

Cognitive Robotics

Research Group

Learning & Autonomous Control

Abstract

Deep reinforcement learning makes it possible to train control policies that map high-dimensional observations to actions. These methods typically use gradient-based optimization techniques to enable relatively efficient learning, but are notoriously sensitive to hyperparameter choices and do not have good convergence properties. Gradient-free optimization methods, such as evolutionary strategies, can offer a more stable alternative but tend to be much less sample efficient. In this work we propose a combination, using the relative strengths of both. We start with a gradient-based initial training phase, which is used to quickly learn both a state representation and an initial policy. This phase is followed by a gradient-free optimization of only the final action selection parameters. This enables the policy to improve in a stable manner to a performance level not obtained by gradient-based optimization alone, using many fewer trials than methods using only gradient-free optimization. We demonstrate the effectiveness of the method on two Atari games, a continuous control benchmark and the CarRacing-v0 benchmark. On the latter we surpass the best previously reported score while using significantly fewer episodes.

Files

1_s2.0_S2405896320329001_main.... (pdf)

(pdf | 0.672 Mb)