Deep Reinforcement Learning - Pretraining actor-critic networks using state representation learning
More Info
expand_more
Abstract
In control, the objective is to find a mapping from states to actions that steer a system to a desired reference. A controller can be designed by an engineer, typically using some model of the system or it can be learned by an algorithm. Reinforcement Learning (RL) is one such algorithm. In RL, the controller is an agent that interacts with the system, with the aim of maximizing the rewards received over time. In recent years, Deep Neural Networks (DNNs) have been successfully used as function approximators in RL algorithms. One particular algorithm, that is used to learn various continuous control tasks, is the Deep Deterministic Policy Gradient (DDPG). The DDPG learns two DNNs, an actor network that maps states to actions and a critic network that is used to find the policy gradient. The policy gradient is subsequently used to update the actor, in the direction that maximizes the rewards over time. A disadvantage of using a DNN as function approximator is the amount of data that is necessary to train such a network. Data, which is not always available or can be expensive to obtain. An advantage of DNNs is that they can cope with high-dimensional state and actions spaces, something other (local) function approx- imators are less suitable for. State Representation Learning (SRL) is a technique that is typically used to lower the dimensionality of the state space. Instead of learning from the raw observations of the system, SRL is used to map a high- dimensional observation to a low-dimensional state, before learning the RL task. The main idea is that the learning algorithm first learns how to extract the relevant information from the system, before it learns to control it. In this thesis two algorithms are designed, the Robotic Prior Deep Deterministic Policy Gradient (RP- DDPG) and the Model Learning Deep Deterministic Policy Gradient (ML-DDPG) that both combine SRL with the DDPG algorithm, with the aim of improving the data-efficiency and/or performance of the original algo- rithm. The two algorithms differ in the type of SRL method that is used. The RP-DDPG uses a concept known as the Robotic Priors, which describes a desirable structure of the state such that it is consistent with physics. The ML-DDPG learns a model of the system. The algorithms are compared on three different benchmark problems. For each benchmark, various ob- servation vectors are “designed”, to simulate different ways of how the information about the state of the system is communicated to the agent. In our experiments, the RP-DDPG is unable to learn two out of three benchmarks problems and requires 4 times more data on the third problem. The ML-DDPG is more success- ful, it outperforms the original DDPG on one benchmark and performs similarly on the other two. In general, the DDPG and the ML-DDPG do not learn state-of-the-art policies. When the task is to track a certain reference signal, the controlled system has a steady-state error and/or significant overshoot for some of the reference positions. It does, however, learn reasonable policies under very difficult circumstances. It can learn (to some degree) to ignore irrelevant inputs, deal with a reference position that is given in a dif- ferent coordinate system than the configuration of its body and learn from high-dimensional observations. Most importantly, it does so, without the need to specifically design or alter the algorithm to deal with these challenges. In a final experiment, the ability of a DNN to generalize what it has learned to examples that differ sub- stantially from examples that it has seen during training is investigated. This ability is what sets a DNN apart from other function approximators and is believed to be the reason why DNN can cope with high- dimensional observations. In the experiment, it is shown that the actor can generalize its policy, i.e., it can produce a control action for observations that differ substantially from observations it has seen during train- ing. Furthermore, the experiment support the claim that a DNN is able to generalize, by learning individual factors that each contribute to the control action independently.