Generalization in Offline Reinforcement Learning: Comparing Implicit Q-Learning with Behavioral Cloning
J.J. Tarazona Rodríguez (TU Delft - Electrical Engineering, Mathematics and Computer Science)
M.T.J. Spaan – Mentor (TU Delft - Sequential Decision Making)
M.R. Weltevrede – Mentor (TU Delft - Sequential Decision Making)
E. Congeduti – Graduation committee member (TU Delft - Computer Science & Engineering-Teaching Team)
More Info
expand_more
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.
Abstract
Offline Reinforcement Learning (Offline RL) involves learning policies from a static dataset without further interactions with the environment, making it suitable for high-stakes scenarios where data collection is costly or risky. This paper investigates the generalization capabilities of Implicit Q-Learning (IQL), an offline RL algorithm, compared to Behavioral Cloning (BC). We adapt the IQL algorithm for discrete control and evaluate both IQL and BC in a four-room environment using training datasets generated from different behavioral policies. Performance is assessed based on average rewards over various test seeds, on reachable and unreachable tasks, as well as the training set. Our results indicate that BC consistently outperforms IQL across all scenarios, although IQL reaches peak performance faster. This study highlights the need for further research into offline RL algorithms for better generalization and more robust performance in diverse environments. Full code available on GitHub.