WCSAC: Worst-Case Soft Actor Critic for Safety-Constrained Reinforcement Learning

None, None; None, None; None, None; None, None

WCSAC: Worst-Case Soft Actor Critic for Safety-Constrained Reinforcement Learning

Conference Paper (2021)

Author(s)

Qisong Yang (TU Delft - Algorithmics)

Thiago D. Simão (TU Delft - Algorithmics)

Simon H. Tindemans (TU Delft - Intelligent Electrical Power Grids)

M.T.J. Spaan (TU Delft - Algorithmics)

Research Group

Algorithmics

Copyright

Reinforcement Learning

To reference this document use:

https://resolver.tudelft.nl/uuid:8504e311-60f1-4fc7-96e1-921884e0900c

More Info

expand_more

Publication Year

2021

Language

English

Copyright

Research Group

Algorithmics

Pages (from-to)

10639-10646

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Safe exploration is regarded as a key priority area for reinforcement learning research. With separate reward and safety signals, it is natural to cast it as constrained reinforcement learning, where expected long-term costs of policies are constrained. However, it can be hazardous to set constraints on the expected safety signal without considering the tail of the distribution. For instance, in safety-critical domains, worst-case analysis is required to avoid disastrous results. We present a novel reinforcement learning algorithm called Worst-Case Soft Actor Critic, which extends the Soft Actor Critic algorithm with a safety critic to achieve risk control. More specifically, a certain level of conditional Value-at- Risk from the distribution is regarded as a safety measure to judge the constraint satisfaction, which guides the change of adaptive safety weights to achieve a trade-off between reward and safety. As a result, we can optimize policies under the premise that their worst-case performance satisfies the constraints. The empirical analysis shows that our algorithm attains better risk control compared to expectation-based methods.

Files

17272_Article_Text_20766_1_2_2... (pdf)

(pdf | 3.4 Mb)

- Embargo expired in 15-11-2021

License info not available