Controlling the estimation bias in deep reinforcement learning problems with sparse rewards

None, None

Controlling the estimation bias in deep reinforcement learning problems with sparse rewards

Towards robust robotic object manipulation learning

Master Thesis (2023)

Author(s)

R. Varga (TU Delft - Mechanical Engineering)

Contributor(s)

Dimitris Boskos – Mentor (TU Delft - Team Dimitris Boskos)

M. Plooij – Mentor (DEMCON advanced mechatronics Delft B.V.)

Jens Kober – Graduation committee member (TU Delft - Learning & Autonomous Control)

Manon Kok – Graduation committee member (TU Delft - Team Manon Kok)

Faculty

Mechanical Engineering

Copyright

Deep Reinforcement Learning TD3 Estimation Bias Object Manipulation Robot Leaning Robosuite

To reference this document use:

https://resolver.tudelft.nl/uuid:9572a11d-5664-4fbb-91d1-c32f6aa49102

More Info

expand_more

Publication Year

2023

Language

English

Copyright

Graduation Date

27-01-2023

Awarding Institution

Delft University of Technology

Programme

['Mechanical Engineering | Systems and Control']

Faculty

Mechanical Engineering

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Many recent robot learning problems, real and simulated, were addressed using deep reinforcement learning. The developed policies can deal with high-dimensional, continuous state and action spaces, and can also incorporate machine-generated or human demonstration data. A great number of them depend on state-action value estimates, especially the ones in the actor-critic framework. Deriving unbiased estimates for these values is still an open research question, mostly since the connection between accurate value estimates and system performance is not yet well-understood. This thesis work has three main research contributions. Firstly, it analyzes the connection between value estimates and performance for the TD3 algorithm. Secondly, it derives theoretical bounds for the true value function when dealing with environments where a reward is only given for successful completion of a task (sparse/binary reward). Lastly, a deliberate underestimation objective is added to the TD3 algorithm together with the theoretical bounds to improve system performance when using human demonstration data that only covers a specific part of the state and action space. All the algorithms are tested and evaluated using simulated robot manipulation tasks in the robosuite environment, where the robot is first trained on the demonstration data and then can gather more experiences in the simulation. Results show that the deliberate underestimation together with the value bounds enable the robot to learn from human demonstration, which was not possible for the standard TD3. Additionally, applying just the value bounds speeds up the learning process when using machine-generated datasets.

Files

MScThesis_EstimationBias.pdf

(pdf | 7.97 Mb)

License info not available