Preference-driven demonstrations ranking for inverse reinforcement learning

More Info
expand_more

Abstract

New flexible teaching methods for robotics are needed to automate repetitive tasks that are currently still done by humans. For limited batch sizes, it is too expensive to teach a robot a new task (Smith & Anderson, 2014). Ideally, such flexible robots can be taught a new task by a non-expert. A non-expert is a person who knows the task the robot should perform, but does not have experience in programming a robot. A powerful method that would allow for flexible robotics without the use of an expert is inverse reinforcement learning (IRL). IRL aims to learn the cost function out of demonstrations, this cost function is subsequently used to learn a policy which realizes the desired task. Current implementations focus more on the IRL algorithm itself and assume that there are enough demonstrations available and the quality of these demonstrations is also close enough to the optimal behaviour (Doerr et al., 2015). Whilst actually these demonstrations are very expensive and non-optimal. This thesis focuses on the effect of the quality of input demonstrations on the performance of the learned trajectory. Furthermore, how imperfect demonstrations still can be used, without lowering the performance of the learned trajectory. The first hypothesis is that the performance of the resulting trajectory depends on the average performance of the input demonstrations and the quantity of the input demonstrations has less of an effect. The second hypothesis is that by adding a ranking to the demonstrations, created through the preferences of non-robotic experts, the performance of the learned trajectory would be better than the average performance of the input demonstrations. The preferences of the non-robotic expert are collected through a crowdsourcing experiment. The preferences of the non-robotic expert are used to create an overall performance measurement. This overall performance measurement is used to obtain the sequentially order of the input demonstrations but also to evaluate the final learned trajectories. The results validate the first hypothesis. The average performance of the input demonstrations is determining the performance of the learned trajectory. The second hypothesis could not be confirmed. The results did not show any improvements in the performance of the learned trajectory when the ranking based on the preference of a non-robotic expert is added. It could be argued that the input demonstrations were too similar or the cost features used in IRL are not specific enough to create different cost functions and therefore create differently performing trajectories.