Policy Learning with Human Teachers

Using directive feedback in a Gaussian framework

More Info
expand_more

Abstract

A prevalent approach for learning a control policy in the model-free domain is by engaging Reinforcement Learning (RL). A well known disadvantage of RL is the necessity for extensive amounts of data for a suitable control policy. For systems that concern physical application, acquiring this vast amount of data might take an extraordinary amount of time. In contrast, humans have shown to be very efficient in detecting a suitable control policy for reference tracking problems. Employing this intuitive knowledge has proven to render model-free learning strategies suitable for physical applications. Recent studies have shown that learning a policy by directive action corrections is a very efficient approach in employing this do-main knowledge. Moreover, feedback based methods do not necessarily require expert knowledge on modelling and control and are therefore more generally applicable. The current state-of-the-art regarding directional feedback was introduced by Celemin and Ruiz-del Solar (2015) and coined COrrective Advice Communicated by Humans (COACH). In this framework the trainer is able to correct the observed actions by providing directive advise for iterative policy updates. However, COACH employs Radial Basis Function (RBF) networks, which limit the capabilities to apply the framework on higher dimensional problems due to an infeasible tuning process.This study introduces Gaussian Process Coach (GPC), an algorithm preserving COACH’s structure, but introducing Gaussian Processes (GPS) as alternative to RBF networks. Moreover, the employment of GPS allows for uncertainty estimation of the policy, which will be used for 1) inquiringhigh-informative feedback samples in an Active Learning (AL) framework, 2) introduce an Adaptive Learning Rate (ALR) that adapts the learning rate to the coarse or refine focused learning phase of the trainer and 3) establish a novel sparsification technique that is specifically designed for iterative GP policy updates. We will show by employing synthesized and human teachers that the novel algorithm outperforms COACH on every domain tested, with the most outspoken difference on higher dimensional problems. Furthermore, we will prove the independent contributions of AL and ALR.