AlwaysSafe: Reinforcement Learning without Safety Constraint Violations during Training

None, None; None, None; None, None

AlwaysSafe: Reinforcement Learning without Safety Constraint Violations during Training

Conference Paper (2021)

Author(s)

Thiago D. Simão (TU Delft - Algorithmics)

Nils Jansen (Radboud Universiteit Nijmegen)

Matthijs T. J. Spaan (TU Delft - Algorithmics)

Research Group

Algorithmics

Copyright

To reference this document use:

https://resolver.tudelft.nl/uuid:6e33d0fd-f80f-46a5-b9b1-af09b15ca5f5

More Info

expand_more

Publication Year

2021

Language

English

Copyright

Abstract

Deploying reinforcement learning (RL) involves major concerns around safety. Engineering a reward signal that allows the agent to maximize its performance while remaining safe is not trivial. Safe RL studies how to mitigate such problems. For instance, we can decouple safety from reward using constrained Markov decision processes (CMDPs), where an independent signal models the safety aspects. In this setting, an RL agent can autonomously find tradeoffs between performance and safety. Unfortunately, most RL agents designed for CMDPs only guarantee safety after the learning phase, which might prevent their direct deployment. In this work, we investigate settings where a concise abstract model of the safety aspects is given, a reasonable assumption since a thorough understanding of safety-related matters is a prerequisite for deploying RL in typical applications. Factored CMDPs provide such compact models when a small subset of features describe the dynamics relevant for the safety constraints. We propose an RL algorithm that uses this abstract model to learn policies for CMDPs safely, that is without violating the constraints. During the training process, this algorithm can seamlessly switch from a conservative policy to a greedy policy without violating the safety constraints. We prove that this algorithm is safe under the given assumptions. Empirically, we show that even if safety and reward signals are contradictory, this algorithm always operates safely and, when they are aligned, this approach also improves the agent's performance.

Files

2021_AAMAS_always_safe_supp.pd... (pdf)

(pdf | 0.939 Mb)

License info not available

P1226.pdf

(pdf | 2.68 Mb)

License info not available