AlwaysSafe: Reinforcement Learning without Safety Constraint Violations during Training

Conference Paper (2021)
Author(s)

Thiago D. Simão (TU Delft - Algorithmics)

Nils Jansen (Radboud Universiteit Nijmegen)

MTJ Spaan (TU Delft - Algorithmics)

Research Group
Algorithmics
Copyright
© 2021 T. D. Simão, Nils Jansen, M.T.J. Spaan
More Info
expand_more
Publication Year
2021
Language
English
Copyright
© 2021 T. D. Simão, Nils Jansen, M.T.J. Spaan
Related content
Research Group
Algorithmics
Pages (from-to)
1226-1235
ISBN (electronic)
9781450383073
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Deploying reinforcement learning (RL) involves major concerns around safety. Engineering a reward signal that allows the agent to maximize its performance while remaining safe is not trivial. Safe RL studies how to mitigate such problems. For instance, we can decouple safety from reward using constrained Markov decision processes (CMDPs), where an independent signal models the safety aspects. In this setting, an RL agent can autonomously find tradeoffs between performance and safety. Unfortunately, most RL agents designed for CMDPs only guarantee safety after the learning phase, which might prevent their direct deployment. In this work, we investigate settings where a concise abstract model of the safety aspects is given, a reasonable assumption since a thorough understanding of safety-related matters is a prerequisite for deploying RL in typical applications. Factored CMDPs provide such compact models when a small subset of features describe the dynamics relevant for the safety constraints. We propose an RL algorithm that uses this abstract model to learn policies for CMDPs safely, that is without violating the constraints. During the training process, this algorithm can seamlessly switch from a conservative policy to a greedy policy without violating the safety constraints. We prove that this algorithm is safe under the given assumptions. Empirically, we show that even if safety and reward signals are contradictory, this algorithm always operates safely and, when they are aligned, this approach also improves the agent's performance.

Files

License info not available
P1226.pdf
(pdf | 2.68 Mb)
License info not available