Controlling the estimation bias in deep reinforcement learning problems with sparse rewards

Towards robust robotic object manipulation learning

Master Thesis (2023)
Author(s)

R. Varga (TU Delft - Mechanical Engineering)

Contributor(s)

Dimitris Boskos – Mentor (TU Delft - Team Dimitris Boskos)

M. Plooij – Mentor (DEMCON advanced mechatronics Delft B.V.)

Jens Kober – Graduation committee member (TU Delft - Learning & Autonomous Control)

Manon Kok – Graduation committee member (TU Delft - Team Manon Kok)

Copyright
© 2023 Roland Varga
More Info
expand_more
Publication Year
2023
Language
English
Copyright
© 2023 Roland Varga
Graduation Date
27-01-2023
Programme
Mechanical Engineering | Systems and Control
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Many recent robot learning problems, real and simulated, were addressed using deep reinforcement learning. The developed policies can deal with high-dimensional, continuous state and action spaces, and can also incorporate machine-generated or human demonstration data. A great number of them depend on state-action value estimates, especially the ones in the actor-critic framework. Deriving unbiased estimates for these values is still an open research question, mostly since the connection between accurate value estimates and system performance is not yet well-understood. This thesis work has three main research contributions. Firstly, it analyzes the connection between value estimates and performance for the TD3 algorithm. Secondly, it derives theoretical bounds for the true value function when dealing with environments where a reward is only given for successful completion of a task (sparse/binary reward). Lastly, a deliberate underestimation objective is added to the TD3 algorithm together with the theoretical bounds to improve system performance when using human demonstration data that only covers a specific part of the state and action space. All the algorithms are tested and evaluated using simulated robot manipulation tasks in the robosuite environment, where the robot is first trained on the demonstration data and then can gather more experiences in the simulation. Results show that the deliberate underestimation together with the value bounds enable the robot to learn from human demonstration, which was not possible for the standard TD3. Additionally, applying just the value bounds speeds up the learning process when using machine-generated datasets.

Files

License info not available