Safe Reinforcement Learning Applications

Master thesis (2019)

Authors

T.M. Monteiro Nunes Aerospace Engineering

Contributors

E. van Kampen (supervisor 1)

Faculty

Aerospace Engineering

More Info

expand_more

To reference this document use:

http://resolver.tudelft.nl/uuid:c7e407a6-0a2a-4828-b5d5-8d3fab78ea96

Published Date

11-12-2019

Language

English

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Faculty

Aerospace Engineering

Abstract

Reinforcement Learning (RL) focuses on maximizing the returns (discounted rewards) throughout the episodes, one of the main challenges when using it is that it is inadequate for safety-critical tasks due to the possibility of transitioning into critical states while exploring. Safe Reinforcement Learning (SafeRL) is a subset of RL that focuses on achieving safe exploration during the learning process and, thus, allowing it to be used in safety critical tasks. This research focuses on expanding already existing SafeRL algorithms through a combination of two previously developed safety metrics into a novel one. Furthermore, this research also uses an ellipsoid-based bounding model to replace the interval analysis bounding model. To validate this combination of the metrics, two different examples are used to test the performance of the various algorithms and compare their relative survivability and computational efficiency: a quadrotor navigation task and an elevator control task. Results show that the novel combined metric (ProxOp) outperforms one of the metrics in the quadrotor navigation task and the other one in the elevator control task. Overall, the combined metric is a better option for usage when there is no extit{a priori} knowledge on how any of the metrics will behave. The ellipsoidal bounding model is tested using the elevator control task and is shown to have comparable performance when used in combination with the proximity and the ProxOp metrics but shows a significant degradation of performance when used solely with the operative metric. The ellipsoidal bounding model is also preferable for use in tasks where there is no knowledge on what are the bounds of the system model, as the ellipsoidal bounding model estimates the model error initially through the use of Gaussian processes. This research, thus, presents a viable novel safety metric as well as an alternative bounding model that can be used with it for applications of RL where safety is important.

Files

Tiago_Nunes_4757122_Final_Thes... (.pdf)

(.pdf | 1.31 Mb)