Safe Reinforcement Learning by Shielding for Autonomous Vehicles

None, None

Safe Reinforcement Learning by Shielding for Autonomous Vehicles

Master Thesis (2021)

Author(s)

J. Zoon (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

Matthijs T.J. Spaan – Mentor (TU Delft - Algorithmics)

Erwin Walraven – Graduation committee member (TNO)

FA Oliehoek – Graduation committee member (TU Delft - Interactive Intelligence)

Faculty

Electrical Engineering, Mathematics and Computer Science

Copyright

Reinforcement Learning Autonomous Vehicles Deep Q-Learning Shielding

To reference this document use:

https://resolver.tudelft.nl/uuid:a2b29def-ee7c-44cd-9350-3b39ee80f29c

More Info

expand_more

Publication Year

2021

Language

English

Copyright

Graduation Date

28-06-2021

Awarding Institution

Delft University of Technology

Programme

['Computer Science']

Faculty

Electrical Engineering, Mathematics and Computer Science

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

In the past few years, there has been much research in the field of Autonomous Vehicles (AV). If AVs are implemented in our daily lives, this could have many advantages. Before this can happen, safe driver models need to be designed which control the AVs. One technique that is suitable to create these models is Reinforcement Learning (RL). A problem here is that an RL agent usually needs to execute random actions during training, which is unsafe when driving an AV. Two shields are proposed to solve this problem: a Safety Checking Shield (SCS) and a Safe Initial Policy Shield (SIPS). The SCS checks whether an action is safe by predicting the future state after taking that action and checking whether that future state is safe. The SIPS checks whether an action is safe by comparing it to a safe action from a Safe Initial Policy. Based on the safety of the current state and this action, a safe range of actions is created in which the chosen action must fall. Furthermore, two shield-based learning techniques are proposed which are part of the RL algorithm and allow the agent to learn to avoid proposing actions that would be overruled by a shield. For the first method, experiences are fabricated, and for the second method, an alternative loss function is adopted. In the CARLA driving simulator, two scenarios were created to test the systems. In the first scenario, the agent needs to learn to drive straight, and in the second scenario, it needs to learn to not hit other vehicles on a straight road. The two shields are built around a Double Deep Q-Network (DDQN) and compared to it. It is shown that both shielding systems have zero collisions during training and execution, while having a similar or even better performance in terms of efficiency in comparison to the baseline DDQN. Furthermore, it is shown that both shield-based learning techniques effectively enable the agent to learn to not propose unsafe actions.

Files

Thesis_Job_Zoon_Final_Version.... (pdf)

(pdf | 10.5 Mb)

License info not available