Deep Reinforcement Learning with Feedback-based Exploration

More Info
expand_more

Abstract

Deep Reinforcement Learning enables us to control increasingly complex and high-dimensional problems. Modelling and control design is longer required, which paves the way to numerous in- novations, such as optimal control of evermore sophisticated robotic systems, fast and efficient scheduling and logistics, effective personal drug dosing schemes that minimise complications, as well as applications not yet conceived. Yet, this potential is obstructed by the need for vast amounts of data. Without it, deep Reinforcement Learning (RL) cannot work. If we want to advance RL re- search and its applications, a primary concern is to improve this sample efficiency. Otherwise, all potential is restricted to settings where interaction is abundant, whilst this is seldom the case in real-world scenarios. In this thesis we will study binary corrective feedback as a general and intuitive manner to in- corporate human intuition and domain knowledge in model-free machine learning. In accordance with our conclusions drawn from literature, we will present two algorithms, namely Probabilistic Merging of Policies (PMP) and its extension Predictive PMP (PPMP). Both methods estimate the abili- ties of their inbuilt Reinforcement Learning (RL) entity by computing the covariance over multiple output heads of the actor network. Subsequently, the corrections are quantified by comparing the uncertainty in what is learned with the inaccuracy of the given feedback. The resulting new action estimates will immediately be applied as probabilistic conditional exploration. The first algorithm is a surprisingly clean and straightforward way to accelerate an off-policy RL baseline and as well improves on existing work that learns from corrections only. Its extension Predictive Probabilistic Merging of Policies (PPMP) predicts the corrected samples. This gives the most substantial improve- ments, whilst the required feedback is further reduced. We demonstrate our algorithms in combination with Deep Deterministic Policy Gradient (DDPG) on continuous control problems of the OpenAI Gym. We show that the greatest part of the otherwise ignorant learning process is indeed evaded. Moreover, we achieve drastic improvements in final performance, robustness to erroneous feedback and feedback efficiency both for simulated and real human feedback, and show that our method is able to outperform the demonstrator.