Testing Deep Reinforcement Learning (DRL) agents is computationally expensive and inefficient, especially when trying to identify environment configurations where the agent fails to reach its objective. Recent work proposes the use of a Multi-Layer-Perceptron (MLP) as a surrogate
...
Testing Deep Reinforcement Learning (DRL) agents is computationally expensive and inefficient, especially when trying to identify environment configurations where the agent fails to reach its objective. Recent work proposes the use of a Multi-Layer-Perceptron (MLP) as a surrogate model to predict whether a given environment configuration is likely to be a failing environment without running the tests. However, this raises the question of whether different surrogate models can perform this task better and whether we can improve the training of these models by fine-tuning their hyperparameters. In this work, we seek to understand how XGBoost performs as an alternative to the MLP for predicting DRL failures, together with grid search as a hyperparameter optimization technique. We evaluated our approach on the Parking environment from the HighwayEnv Simulator using a DRL agent trained using Hindsight Experience Replay (HER), where a failing environment is one in which the vehicle collides with a parked vehicle or a timeout occurs. This evaluation is based on how well the model can classify failing environments and how effective it can guide a Genetic Algorithm (GA) to find failing environments (failure search). We compared the performance of XGBoost with the MLP and found that XGBoost significantly outperforms the MLP baseline across key classification metrics such as F1-score and AUC-ROC. Furthermore, during failure search, the XGBoost-guided GA yields more failing environments with greater coverage and entropy, indicating increased diversity and effectiveness. These findings suggest that XGBoost is a strong candidate for surrogate modelling in DRL testing and offers a more reliable alternative to an MLP-based approach.