Safe Reinforcement Learning by Shielding for Autonomous Vehicles

Master Thesis (2021)
Author(s)

J. Zoon (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

Matthijs T.J. Spaan – Mentor (TU Delft - Algorithmics)

Erwin Walraven – Graduation committee member (TNO)

FA Oliehoek – Graduation committee member (TU Delft - Interactive Intelligence)

Faculty
Electrical Engineering, Mathematics and Computer Science
Copyright
© 2021 Job Zoon
More Info
expand_more
Publication Year
2021
Language
English
Copyright
© 2021 Job Zoon
Graduation Date
28-06-2021
Awarding Institution
Delft University of Technology
Programme
['Computer Science']
Faculty
Electrical Engineering, Mathematics and Computer Science
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

In the past few years, there has been much research in the field of Autonomous Vehicles (AV). If AVs are implemented in our daily lives, this could have many advantages. Before this can happen, safe driver models need to be designed which control the AVs. One technique that is suitable to create these models is Reinforcement Learning (RL). A problem here is that an RL agent usually needs to execute random actions during training, which is unsafe when driving an AV. Two shields are proposed to solve this problem: a Safety Checking Shield (SCS) and a Safe Initial Policy Shield (SIPS). The SCS checks whether an action is safe by predicting the future state after taking that action and checking whether that future state is safe. The SIPS checks whether an action is safe by comparing it to a safe action from a Safe Initial Policy. Based on the safety of the current state and this action, a safe range of actions is created in which the chosen action must fall. Furthermore, two shield-based learning techniques are proposed which are part of the RL algorithm and allow the agent to learn to avoid proposing actions that would be overruled by a shield. For the first method, experiences are fabricated, and for the second method, an alternative loss function is adopted. In the CARLA driving simulator, two scenarios were created to test the systems. In the first scenario, the agent needs to learn to drive straight, and in the second scenario, it needs to learn to not hit other vehicles on a straight road. The two shields are built around a Double Deep Q-Network (DDQN) and compared to it. It is shown that both shielding systems have zero collisions during training and execution, while having a similar or even better performance in terms of efficiency in comparison to the baseline DDQN. Furthermore, it is shown that both shield-based learning techniques effectively enable the agent to learn to not propose unsafe actions.

Files

License info not available