Low level quadcopter control using Reinforcement Learning

Developing a self-learning drone

Master thesis (2020)

Authors

T. Koning Mechanical Engineering

Contributors

W. Pan Robot Dynamics - Mechanical, Maritime and Materials Engineering (mentor)

Faculty

Mechanical Engineering, Mechanical Engineering

More Info

expand_more

To reference this document use:

http://resolver.tudelft.nl/uuid:0b9e0796-13b5-42ba-b231-fbb6aadd5233

Published Date

14-04-2020

Language

English

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Faculty

Mechanical Engineering

Abstract

Reinforcement Learning (RL) is a learning paradigm where an agent learns a task by trial and error. The agent needs to explore its environment and by simultaneously receiving rewards it learns what is appropriate behaviour.
Even though it has roots in machine learning, RL is essentially different from other machine learning methods. In contrast to others, RL agent has to generate its own data to learn from. In this thesis, we aim to train an RL agent to fly a quadcopter to track any target position (way-point) in three dimensional space. Where conventional control strategies for quadcopters involve a separate attitude and position controller and most RL solutions focus on one of the two controllers, our goal is to design a low level RL controller capable of computing motor commands directly from sensor input, therefore replacing both attitude and position controller with one RL policy. The policies we develop utilize the algorithm ’Twin Delayed Deep Deterministic Policy Gradient’ (TD3) for learning. TD3 is a variant of the Deep Deterministic Policy Gradient (DDPG) algorithm.
The policy for attitude control trained for 3500 episodes 3, around 6.1e5 time steps. The learned policy is able to stabilize the attitude of the quadcopter (in simulation) with a success rate of 94 %. For position control, two policies are generated with two different types of dense reward. The resulting type 1 policy has high fluctuations in motor commands and therefore oscillating attitude and position values. In none of the evaluation trajectories a steady state value is reached. The type 2 produces a working policy after a shorter training time of 1200 episodes, 1.1푒6 time steps. For all tested trajectories, this policy achieves steady state for almost each way-point. This thesis proves that TD3 can be used for low-level quadcopter control, replacing both inner and outer loops of the quadcopter control. Using the dense reward function and applying negative reward on position control only results in a stable policy that can track way-points all throughout the 3D space. Future work requires the twostep parameter estimation to be tested on a real life quadcopter, as well as enrolling the policy onto a real life quadcopter.

Files

Thesis_Tim_Koning_4095286.pdf

(.pdf | 15.6 Mb)