Guaranteed globally optimal continuous reinforcement learning

Master thesis (2012)

Authors

H.J. Bijl

Contributors

E. Van Kampen (mentor)

Department

Control & Simulation (Aerospace Engineering) (TU Delft)

Reinforcement learning Interval analysis Self-learning controllers Proven convergence Interval Q-learning

More Info

expand_more

To reference this document use:

http://resolver.tudelft.nl/uuid:dbe531db-f69d-40dc-8dc7-c0fd20cf2e01

Published Date

02-07-2012

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Faculty

Aerospace Engineering

Department

Control & Simulation

Abstract

Self-learning controllers offer various strong benefits over conventional controllers, the most important one being their ability to adapt to unexpected circumstances. Their application is however limited, the most important reason being that, for self-learning controllers to work on continuous domains, nonlinear function approximators are required, and as soon as nonlinear function approximators are involved, it is uncertain whether convergence will occur. This project has as goal to contribute towards achieving convergence guarantees. The first focus lies on using reinforcement learning, combined with neural network function approximation, to create self-learning controllers. A reinforcement learning controller architecture has been set up which is capable of controlling systems with continuous states and actions. Also an extension has been made that enables the controller to freely vary its timestep without any significant consequences. A literature research has shown that there are no convergence proofs yet of practically feasible reinforcement learning controllers with nonlinear function approximators. Several proofs of convergence for the case of linear function approximators have been provided. Also, a proof exists that a certain reinforcement learning controller algorithm with nonlinear function approximation converges. However, the corresponding learning algorithm requires an infinite amount of iterations for the value function to converge even before the policy is updated, making it practically unfeasible. The reasons why convergence often does not occur for reinforcement learning controllers with neural network function approximators include overestimated value functions, incorrect generalization, function approximators incapable of approximating the actual value function and error amplification due to bootstrapping. Furthermore, when convergence does occur, it’s not necessarily to a global optimum. Ideas to solve these issues have been offered, but it is still far from certain whether these ideas will work. Furthermore, they will make the algorithm very complex. Proving that the resulting algorithm will converge, even if it’s only convergence to a local optimum, will prove to be extremely difficult, if it is at all possible. Hence reinforcement learning controllers with neural network function approximators do not seem to be appropriate if convergence to the global optimum needs to be ensured. The literature research has also revealed that so far no attempt has been made in combining reinforcement learning with interval analysis. To investigate the possibilities of combining these two fields, the interval Q-learning algorithm has been designed. This algorithm combines a discrete version of Q-learning with interval analysis techniques. This algorithm has proven convergence for the discrete case. Subsequently, the discrete interval Q-learning algorithm has been expanded to the continuous domain. This was done using the main RL value assumption, which assumes that the derivatives of the value function with respect to every state parameter and action parameter have known bounds. The resulting continuous interval Q-learning algorithm was shown to have proven convergence to the optimal value function. Furthermore, bounds on how fast the algorithm converges were given. The most important downside of the first version of the continuous interval Q-learning algorithm was its slow run-time. This made the algorithm practically infeasible. To still meet the goals of the project, a different function approximator was designed. This function approximator used many small blocks to bound the optimal value function in every part of the state-action-space. Furthermore, the number of blocks was increased dynamically as the algorithm learned, thus giving the function approximator a theoretically unlimited accuracy. Though the resulting algorithm used information slightly less efficiently than its precursor, its run-time was significantly improved. It became practically feasible to apply this algorithm. In the end of the report, the algorithm has been applied to a few simple test problems. An important parameter here was the dimension D of the problem, which equals the sum of the number of state and action parameters. For two- and three-dimensional problems, the algorithm was able to sufficiently bound the optimal value function quite quickly, resulting in a controller with satisfactory performance. For the cart and pendulum system, which is a five-dimensional problem, this turned out to be different. A long training (in the order of several hours or more) will be required before a satisfactory performance can be obtained. However, since the algorithm has proven convergence to the globally optimal policy, this does not necessarily have to be a big problem. Finally, it is mentioned that this thesis report has introduced the world’s first combination of reinforcement learning and interval analysis. It also introduced the world’s first practically feasible continuous RL controller with proven convergence to the global optimum. The key to accomplishing such a controller turned out to be (A) letting go of conventional ways of designing continuous RL controllers, and (B) quantifying the assumption that the value Q of two nearby states are similar through the main RL value assumption.

Files

ThesisReportHildoBijl.pdf

(pdf | 4.74 Mb)