# Learning Parameter Selection in Continuous Reinforcement Learning: Attempting to Reduce Tuning Effords

Learning Parameter Selection in Continuous Reinforcement Learning: Attempting to Reduce Tuning Effords

Author Contributor Faculty Department Programme Date2012-07-18

AbstractThe reinforcement learning (RL) framework enables to construct controllers that try to find find an optimal control strategy in an unknown environment by trial-and-error. After selecting a control action, the controller receives a numerical reward. The reward signal is based on the current state of the environment and the applied control action. The controller aims to maximize the cumulative reward, known as the return. In this thesis actor-critic and criticonly RL algorithms are considered. Actor-critic algorithms consist of an element that selects the actions (the actor) and an element that learns the expectation of the return (the critic). This expectation is captured in a value function. The critic is used to improve the control policy of the actor. Critic-only algorithms select the action by direct optimization over a value function. Before a RL algorithm can be applied to a control problem, a number of learning parameters need to be set. The optimal values of some of these parameters are highly problem dependent. It is not straightforward how these parameters should be chosen and often these are often determined by trying a large set of parameters. The main focus of thesis is to devise an action selection method that is able to select continuous actions without problem dependent parameters. Two approaches are taken: first, it is investigated if Levenberg Marquardt (LM), a popular optimization method, can be used to determine the actor update step. Second, an action selection method is treated that lacks an explicit actor, called Value-Gradient Based Policy (VGBP). The LM algorithm uses the gradient and the Hessian to compute the update step. Therefore the policy gradient and Hessian need to be found. A novel actor-critic method has been devised, called Vanilla Actor-Critic (VAC), that efficiently learns the policy gradient. On the inverted pendulum swing-up task this algorithm outperformed Natural Actor-Critic (NAC). A number of different approaches have been taken to approximate the policy Hessian, but none delivered a proper Hessian estimate. Therefore, no LM actor update method was created. In VGBP the action is found by optimization of the right hand side of the Bellman equation. VGBP uses the provided reward function and a process model for this optimization. The process model is learned online using local linear regression (LLR). Due to the efficient use of information VGBP shows fast learning on the pendulum and a 2-DOF robotic arm.

Subject To reference this document use:http://resolver.tudelft.nl/uuid:94b81bc2-aff6-457f-9b54-be5e005def38

Embargo date2014-07-18

Part of collectionStudent theses

Document typemaster thesis

Rights(c) 2012 Van Rooijen, J.C.