Person | TU Delft Repository

Carlos Celemin

View Pure Profile

Authored

12 records found

Interactive Learning of Temporal Features for Control

Shaping Policies and State Representations From Human Feedback

Journal article - Rodrigo Pérez-Dattari, Carlos Celemin, G. Franzese, Javier Ruiz-del-Solar, J. Kober

Current ongoing industry revolution demands more flexible products, including robots in household environments and medium-scale factories. Such robots should be able to adapt to new conditions and environments and be programmed with ease. As an example, let us suppose that there ...

Continuous control for high-dimensional state spaces

An interactive learning approach

Conference paper - Rodrigo Pérez-Dattari, Carlos Celemin, Javier Ruiz-del-Solar, J. Kober

Deep Reinforcement Learning (DRL) has become a powerful methodology to solve complex decision-making problems. However, DRL has several limitations when used in real-world problems (e.g., robotics applications). For instance, long training times are required and cannot be acceler ...

Knowledge- and ambiguity-aware robot learning from corrective and evaluative feedback

Journal article - Carlos Celemin, J. Kober

In order to deploy robots that could be adapted by non-expert users, interactive imitation learning (IIL) methods must be flexible regarding the interaction preferences of the teacher and avoid assumptions of perfect teachers (oracles), while considering they make mistakes influe ...

A fast hybrid reinforcement learning framework with human corrective feedback

Journal article - Carlos Celemin, Carlos Celemin, Javier Ruiz-del-Solar, J. Kober

Reinforcement Learning agents can be supported by feedback from human teachers in the learning loop that guides the learning process. In this work we propose two hybrid strategies of Policy Search Reinforcement Learning and Interactive Machine Learning that benefit from both sour ...

Reinforcement learning of motor skills using Policy Search and human corrective advice

Journal article - Carlos Celemin, Carlos Celemin, Guilherme Maeda, Guilherme Maeda, Javier Ruiz-del-Solar, Jan Peters, J. Kober

Robot learning problems are limited by physical constraints, which make learning successful policies for complex motor skills on real systems unfeasible. Some reinforcement learning methods, like Policy Search, offer stable convergence toward locally optimal solutions, whereas in ...

Interactive Learning with Corrective Feedback for Policies Based on Deep Neural Networks

Conference paper - Rodrigo Pérez-Dattari, Carlos Celemin, Javier Ruiz-del-Solar, J. Kober

Deep Reinforcement Learning (DRL) has become a powerful strategy to solve complex decision making problems based on Deep Neural Networks (DNNs). However, it is highly data demanding, so unfeasible in physical systems for most applications. In this work, we approach an alternative ...

Learning Interactively to Resolve Ambiguity in Reference Frame Selection

Journal article - G. Franzese, Carlos Celemin, J. Kober

In Learning from Demonstrations, ambiguities can lead to bad generalization of the learned policy. This paper proposes a framework called Learning Interactively to Resolve Ambiguity (LIRA), that recognizes ambiguous situations, in which more than one action have similar probabili ...

Interactive Imitation Learning in State-Space

Journal article - Snehal Jauhri, Carlos Celemin, J. Kober

Imitation Learning techniques enable programming the behavior of agents through demonstrations rather than manual engineering. However, they are limited by the quality of available demonstration data. Interactive Imitation Learning techniques can improve the efficacy of learning ...

Interactive Imitation Learning in State-Space

Journal article - Snehal Jauhri, Carlos Celemin, J. Kober

Simultaneous learning of objective function and policy from interactive teaching with corrective feedback

Conference paper - Carlos Celemin, J. Kober

Some imitation learning approaches rely on Inverse Reinforcement Learning (IRL) methods, to decode and generalize implicit goals given by expert demonstrations. The study of IRL normally has the assumption of available expert demonstrations, which is not always possible. There ar ...

Deep Reinforcement Learning with Feedback-based Exploration

Conference paper - Jan Scholten, Daan Wout, Carlos Celemin, J. Kober

Deep Reinforcement Learning has enabled the control of increasingly complex and high-dimensional problems. However, the need of vast amounts of data before reasonable performance is attained prevents its widespread application. We employ binary corrective feedback as a general an ...

Uncertainties based queries for Interactive policy learning with evaluations and corrections

Conference paper - Carlos Celemin, J. Kober

Contributed

7 records found

Interactive Learning in State-space

Enabling robots to learn from non-expert humans

Master thesis - S. Jauhri, J. Kober, Carlos Celemin, A.J. van Genderen, L. Peternel

Imitation Learning is a technique that enables programming the behavior of agents through demonstration, as opposed to manually engineering behavior. However, Imitation Learning methods require demonstration data (in the form of state-action labels) and in many scenarios, the dem ...

Towards Corrective Deep Imitation Learning in Data Intensive Environments

Helping robots to learn faster by leveraging human knowledge

Master thesis - I. Lopez Bosque, Carlos Celemin, Rodrigo Pérez-Dattari, J. Kober, W. Pan

Interactive imitation learning refers to learning methods where a human teacher interacts with an agent during the learning process providing feedback to improve its behaviour. This type of learning may be preferable with respect to reinforcement learning techniques when dealing with real-world problems. This fact is especially true in the case of robotic applications where reinforcement learning may be unfeasible as there are long training times and reward functions can be hard to shape/compute. The present thesis focuses on interactive learning with corrective feedback and, in particular, in the framework Deep Corrective Advice Communicated by Humans (D-COACH), which has successfully shown to be advantageous in terms of training time and data efficiency. D-COACH, a supervised learning method whose policy is represented by an artificial neural network, incorporates a replay buffer where samples of states and corresponding labels gathered by the agent's policy from human feedback are stored and replayed. However, this causes conflicts between the data in the buffer because samples collected by older versions of the policy may be contradictory and could deteriorate the performance of the current policy. In order to reduce this issue, the current implementation of D-COACH uses a first-in-first-out buffer with limited size, as the older the sample is, the more likely it is to deteriorate the performance of the learner. Nonetheless, this limitation propitiates catastrophic forgetting, an inherent tendency of neural networks to forget what they have already learnt, and that can be mitigated by replaying information gathered during all the stages of the problem. Therefore, D-COACH suffers from a trade-off between reducing conflicting data and avoiding catastrophic forgetting. The fact that D-COACH limits the size of its buffer automatically restricts the types of problems that it can solve, given that, if the problem is too complex (i.e. it requires large amounts of data), it simply will not be able to remember everything. If we want to utilise a buffer to train data intensive tasks with corrective feedback, a new method is needed to solve the problem of using information gathered by older versions of the policy. We propose an improved version of D-COACH, which we call Batch Deep COACH (BD-COACH, pronounced “be the coach”). BD-COACH incorporates a human model module that learns the feedback from the teacher and that can be employed to make corrections gathered by older versions of the policy still useful for batch updating the current version of the policy. To compare the performance of BD-COACH with respect to D-COACH, three simulated experiments were done using the open-source Meta-World benchmark, which is based on MuJoCo and OpenAI gym. Moreover, to validate the proposed method in a real setup, two planar manipulation tasks were solved using a seven degrees of freedom KUKA robot arm. Furthermore, we present an analysis between on-policy and off-policy methods both in the fields of reinforcement learning and in imitation learning. We believe there is an interesting simile between this classification and the problem of correctly implementing a replay buffer when learning from corrective feedback.

Policy Learning with Human Teachers

Using directive feedback in a Gaussian framework

Master thesis - D. Wout, J. Kober, Carlos Celemin, D. Gavrila

A prevalent approach for learning a control policy in the model-free domain is by engaging Reinforcement Learning (RL). A well known disadvantage of RL is the necessity for extensive amounts of data for a suitable control policy. For systems that concern physical application, acquiring this vast amount of data might take an extraordinary amount of time. In contrast, humans have shown to be very efficient in detecting a suitable control policy for reference tracking problems. Employing this intuitive knowledge has proven to render model-free learning strategies suitable for physical applications. Recent studies have shown that learning a policy by directive action corrections is a very efficient approach in employing this do-main knowledge. Moreover, feedback based methods do not necessarily require expert knowledge on modelling and control and are therefore more generally applicable. The current state-of-the-art regarding directional feedback was introduced by Celemin and Ruiz-del Solar (2015) and coined COrrective Advice Communicated by Humans (COACH). In this framework the trainer is able to correct the observed actions by providing directive advise for iterative policy updates. However, COACH employs Radial Basis Function (RBF) networks, which limit the capabilities to apply the framework on higher dimensional problems due to an infeasible tuning process.This study introduces Gaussian Process Coach (GPC), an algorithm preserving COACH’s structure, but introducing Gaussian Processes (GPS) as alternative to RBF networks. Moreover, the employment of GPS allows for uncertainty estimation of the policy, which will be used for 1) inquiringhigh-informative feedback samples in an Active Learning (AL) framework, 2) introduce an Adaptive Learning Rate (ALR) that adapts the learning rate to the coarse or refine focused learning phase of the trainer and 3) establish a novel sparsification technique that is specifically designed for iterative GP policy updates. We will show by employing synthesized and human teachers that the novel algorithm outperforms COACH on every domain tested, with the most outspoken difference on higher dimensional problems. Furthermore, we will prove the independent contributions of AL and ALR.

Interactive Imitation Learning for Force control

Position And Stiffness Teaching with Interactive Learning

Master thesis - N.P. Lander, Carlos Celemin, J. Kober, L. Peternel

To generalize the use of robotics, there are a few hurdles still to take. One of these hurdles is the programming of the robots. Most robots on the market today employ position control, with a set of controller parameters tuned by an expert. This programming is quite expensive, only suited for a single task, in a single configuration, and not interaction safe. This thesis tries to solve these problems, by introducing Position And Stiffness Teaching with Interactive Learning (PASTIL) and History Aware PASTIL (HA-PASTIL), a novel interactive way of learning scalable variable impedance policies. The system is able to learn both positional reference trajectories and stiffness trajectories at the same time. PASTIL and HA-PASTIL learn these policies from positional corrections applied by a human teacher, through physical human robot interaction (pHRI). For the measurement and extraction of these corrections only the proprioception sensors of the robot are used, so no force/torque sensors are required. To learn from these corrections, the intention of the teacher is estimated, by segmenting the correction space in three parts. Each of these three parts correspond to a set of update rules for the policy, that fit the intention of a correction in that segment. In this thesis, the proposed algorithms are validated through a series of experiments with sample tasks, and compared with baseline algorithms. The main conclusions from these tests are that PASTIL and HA-PASTIL, as introduced in this thesis, outperform the baseline algorithms on task performance for all tasks and that the learned stiffness makes a positive contribution to task performance. This means that the algorithms proposed here allow for simple systems, with only proprioception sensors, to be instructed by users, instead of experts. This makes it possible for robotics to be applied at lower cost, with less expertise needed to program and operate. These algorithms, however, still have some aspects that could use further research. The most important example is that they are not yet tested on an actual robot, with physical human robot interaction. There is still quite some work left to do, but the proposed algorithms might pave the way for more, and better, algorithms that aim to learn force control behaviour form only positional corrections.

Adaptation of a non-linear controller based on Reinforcement Learning

Master thesis - V. Khattar, R. Babuska, B. Shyrokau, Carlos Celemin

Closed-loop control systems, which utilize output signals for feedback to generate control inputs, can achieve high performance. However, robustness of feedback control loops can be lost if system changes and uncertainties are too large. Adaptive control combines the traditional ...

Learning Task Space Policies from Demonstration

Master thesis - L.K. Suresh Kumar, J. Kober, Carlos Celemin

In this thesis, we propose a method titled "Task Space Policy Learning (TaSPL)", a novel technique that learns a generalised task/state space policy, as opposed to learning a policy in state-action space, from interactive corrections in the observation space or from state only de ...

Deep Reinforcement Learning with Feedback-based Exploration

Master thesis - J.J. Scholten, J. Kober, Carlos Celemin, J.C.F. de Winter, S. Wahls

Deep Reinforcement Learning enables us to control increasingly complex and high-dimensional problems. Modelling and control design is longer required, which paves the way to numerous in- novations, such as optimal control of evermore sophisticated robotic systems, fast and efficient scheduling and logistics, effective personal drug dosing schemes that minimise complications, as well as applications not yet conceived. Yet, this potential is obstructed by the need for vast amounts of data. Without it, deep Reinforcement Learning (RL) cannot work. If we want to advance RL re- search and its applications, a primary concern is to improve this sample efficiency. Otherwise, all potential is restricted to settings where interaction is abundant, whilst this is seldom the case in real-world scenarios. In this thesis we will study binary corrective feedback as a general and intuitive manner to in- corporate human intuition and domain knowledge in model-free machine learning. In accordance with our conclusions drawn from literature, we will present two algorithms, namely Probabilistic Merging of Policies (PMP) and its extension Predictive PMP (PPMP). Both methods estimate the abili- ties of their inbuilt Reinforcement Learning (RL) entity by computing the covariance over multiple output heads of the actor network. Subsequently, the corrections are quantified by comparing the uncertainty in what is learned with the inaccuracy of the given feedback. The resulting new action estimates will immediately be applied as probabilistic conditional exploration. The first algorithm is a surprisingly clean and straightforward way to accelerate an off-policy RL baseline and as well improves on existing work that learns from corrections only. Its extension Predictive Probabilistic Merging of Policies (PPMP) predicts the corrected samples. This gives the most substantial improve- ments, whilst the required feedback is further reduced. We demonstrate our algorithms in combination with Deep Deterministic Policy Gradient (DDPG) on continuous control problems of the OpenAI Gym. We show that the greatest part of the otherwise ignorant learning process is indeed evaded. Moreover, we achieve drastic improvements in final performance, robustness to erroneous feedback and feedback efficiency both for simulated and real human feedback, and show that our method is able to outperform the demonstrator.