Deep Reinforcement Learning for Bipedal Robots

Master Thesis (2017)
Authors

D. Rastogi (TU Delft - Mechanical Engineering)

Faculty
Mechanical Engineering, Mechanical Engineering
Copyright
© 2017 Divyam Rastogi
More Info
expand_more
Publication Year
2017
Language
English
Copyright
© 2017 Divyam Rastogi
Graduation Date
2017-8
Awarding Institution
Delft University of Technology
Faculty
Mechanical Engineering, Mechanical Engineering
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Reinforcement Learning (RL) is a general purpose framework for designing controllers for non-linear systems. It tries to learn a controller (policy) by trial and error. This makes it highly suitable for systems which are difficult to control using conventional control methodologies, such as walking robots. Traditionally, RL has only been applicable to problems with low dimensional state space, but use of Deep Neural Networks as function approximators with RL have shown impressive results for control of high dimensional systems. This approach is known as Deep Reinforcement Learning (DRL).

A major drawback of DRL algorithms is that they generally require a large number of samples and training time, which becomes a challenge when working with real robots. Therefore, most applications of DRL methods have been limited to simulation platforms. Moreover, due to model uncertainties like friction and model inaccuracies in mass, lengths etc., a policy that is trained on a simulation model might not work directly on a real robot.

The objective of the thesis is to apply a DRL algorithm, the Deep Deterministic Policy Gradient (DDPG), for a 2D bipedal robot. The bipedal robot used for analysis is developed by the Delft BioRobotics Lab for Reinforcement Learning purposes and is known as LEO. The DDPG method is applied on a simulated model of LEO and compared with traditional RL methods like SARSA. To overcome the problem of high sample requirement when learning a policy on the real system, an iterative approach is developed in this thesis which learns a difference model and then learns a new policy with this difference model. The difference model captures the mismatch between the real robot and the simulated model.

The approach is tested for two experimental setups in simulation, an inverted pendulum problem and LEO. The difference model is able to learn a policy which is almost optimal compared to training on a real system from scratch, with only \SI{10}{\percent} of the samples required.

Files

License info not available