Surrogate Reloaded: Fast Testing for Deep Reinforcement Learning
Convolutional Neural Networks as surrogate model for DRL testing
L.M. Braszczyński (TU Delft - Electrical Engineering, Mathematics and Computer Science)
A. Panichella – Mentor (TU Delft - Software Engineering)
A.J. Bartlett – Mentor (TU Delft - Multimedia Computing)
More Info
expand_more
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.
Abstract
In recent years, Deep Reinforcement Learning (DRL) has moved away from playing games to more practical tasks like autonomous parking. This transition has created a need for efficient testing of DRL agents. To evaluate an agent, we need to run a simulation of the task and let the agent decide what actions to perform. Running these simulations is expensive and time-consuming. The problem of testing is that we need hundreds of different test scenarios, and each simulation takes from seconds to minutes depending on the use case, which is why choosing the right initial state of the simulation is critical. Current solutions leverage the idea of surrogate models that can approximate how difficult it will be for an agent to complete a given environment without running the tests. Existing work has explored the use of surrogate models for DRL tasks, creating a Multi-Layer Perceptron to act as a surrogate model for a DRL agent attempting to park a car in a parking lot. Parking scenarios are inherently spatial problems, and MLPs are not able to take advantage of that spatial structure. To address this limitation, we used Convolutional Neural Networks (CNNs), which are designed to handle spatial information more effectively, and should therefore improve prediction performance over MLPs. Our approach transforms tabular data into a low-resolution grid-like image representing a parking lot. This approach guides a genetic algorithm to discover a mean of 25.76 failures per run, a 72% improvement over the 14.98 failures found by the MLP baseline. Our model also achieves significantly higher scores on diversity metrics of generated environments.