Low level quadcopter control using Reinforcement Learning

Developing a self-learning drone

More Info


Reinforcement Learning (RL) is a learning paradigm where an agent learns a task by trial and error. The agent needs to explore its environment and by simultaneously receiving rewards it learns what is appropriate behaviour.
Even though it has roots in machine learning, RL is essentially different from other machine learning methods. In contrast to others, RL agent has to generate its own data to learn from. In this thesis, we aim to train an RL agent to fly a quadcopter to track any target position (way-point) in three dimensional space. Where conventional control strategies for quadcopters involve a separate attitude and position controller and most RL solutions focus on one of the two controllers, our goal is to design a low level RL controller capable of computing motor commands directly from sensor input, therefore replacing both attitude and position controller with one RL policy. The policies we develop utilize the algorithm ’Twin Delayed Deep Deterministic Policy Gradient’ (TD3) for learning. TD3 is a variant of the Deep Deterministic Policy Gradient (DDPG) algorithm.
The policy for attitude control trained for 3500 episodes 3, around 6.1e5 time steps. The learned policy is able to stabilize the attitude of the quadcopter (in simulation) with a success rate of 94 %. For position control, two policies are generated with two different types of dense reward. The resulting type 1 policy has high fluctuations in motor commands and therefore oscillating attitude and position values. In none of the evaluation trajectories a steady state value is reached. The type 2 produces a working policy after a shorter training time of 1200 episodes, 1.1푒6 time steps. For all tested trajectories, this policy achieves steady state for almost each way-point. This thesis proves that TD3 can be used for low-level quadcopter control, replacing both inner and outer loops of the quadcopter control. Using the dense reward function and applying negative reward on position control only results in a stable policy that can track way-points all throughout the 3D space. Future work requires the twostep parameter estimation to be tested on a real life quadcopter, as well as enrolling the policy onto a real life quadcopter.