Evaluating and Enhancing the Robustness of Proximal Policy Optimization to Test-Time Corruptions in Sequential Domains
M.R. Rodić (TU Delft - Electrical Engineering, Mathematics and Computer Science)
M.M. Celikok – Mentor (TU Delft - Sequential Decision Making)
FA Oliehoek – Mentor (TU Delft - Sequential Decision Making)
Annibale Panichella – Graduation committee member (TU Delft - Software Engineering)
More Info
expand_more
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.
Abstract
Reinforcement learning (RL) agents often achieve impressive results in simulation but can fail catastrophically when facing small deviations at deployment time. In this work, we examine the brittleness of Proximal Policy Optimization (PPO) agents when subjected to test-time observation noise and evaluate techniques for improving robustness. We compare four variants—feed-forward PPO, Recurrent PPO (with LSTM memory), Noisy-PPO (trained with injected observation noise), and Recurrent-Noisy PPO—across two benchmarks: the classic CartPole-v1 and the more realistic Highway-env. Performance is measured over 100 episodes per corruption level, using mean return, success rate, and the Area-Under-Degradation-Curve (AUDC) as robustness metrics. Our results show that noise-augmented training yields the largest gains, with Noisy-PPO maintaining its clean-condition performance even at high noise levels, while recurrence alone offers more modest improvement. In the Highway-env, both noise injection and LSTM memory improve returns, indicating that a simple integration of noise augmentation or recurrence can enhance PPO’s robustness to real-world uncertainties.