Prioritized Experience Replay based on the Wasserstein Metric in Deep Reinforcement Learning

None, None

Prioritized Experience Replay based on the Wasserstein Metric in Deep Reinforcement Learning

The regularizing effect of modelling return distributions

Master Thesis (2019)

Author(s)

T. Greevink (TU Delft - Mechanical Engineering)

Contributor(s)

T.D. De Bruin – Mentor (TU Delft - Learning & Autonomous Control)

Jens Kober – Graduation committee member (TU Delft - Learning & Autonomous Control)

J. Hellendoorn – Graduation committee member (TU Delft - Cognitive Robotics)

Faculty

Mechanical Engineering

Copyright

Deep Reinforcement Learning Regularization Distributional Reinforcement Learning Wasserstein metric QR-DQN Prioritized Experience Replay

To reference this document use:

https://resolver.tudelft.nl/uuid:6397c8d3-c96f-490e-b3aa-2cb3ce447f4a

More Info

expand_more

Publication Year

2019

Language

English

Copyright

Graduation Date

12-04-2019

Awarding Institution

Delft University of Technology

Programme

Mechanical Engineering | Systems and Control

Faculty

Mechanical Engineering

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

This thesis tests the hypothesis that distributional deep reinforcement learning (RL) algorithms get an increased performance over expectation based deep RL because of the regularizing effect of fitting a more complex model. This hypothesis was tested by comparing two variations of the distributional QR-DQN algorithm combined with prioritized experience replay. The first variation, called QR-W, prioritizes learning the return distributions. The second one, QR-TD, prioritizes learning the Q-Values. These algorithms were be tested with a range of network architectures. From too large architectures which are prone to overfitting, to smaller ones prone to underfitting. To verify the findings the experiment was done in two environments. As hypothesised, QR-W performed better on the networks prone to overfitting, and QR-TD performed better on networks prone to underfitting. This suggests that fitting distributions has a regularizing effect, which at least partially explains the performance of distributional algorithms. To compare QR-TD and QR-W to conventional benchmarks from literature they were tested in the Enduro environment from the arcade learning environment proposed by Bellemare. QR-W outperformed the state-of-the-art algorithms IQN and Rainbow in a quarter of the training time.

Files

Thesis.pdf

(pdf | 1.99 Mb)

License info not available