Multi-task Offline Reinforcement Learning with CQL
A study on how dataset size and diversity increase generalization performance
L. Lipinskas (TU Delft - Electrical Engineering, Mathematics and Computer Science)
M.T.J. Spaan – Mentor (TU Delft - Sequential Decision Making)
M.R. Weltevrede – Mentor (TU Delft - Sequential Decision Making)
E. Congeduti – Graduation committee member (TU Delft - Computer Science & Engineering-Teaching Team)
More Info
expand_more
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.
Abstract
Reinforcement learning (RL) is a type of machine learning where a model learns by making an observation of the current state it is in, picking out an action to execute, and observing the reward of said action, after which it receives the next state and repeats the cycle until it reaches its goal. The traditional online training approach allows the agent to directly interact with the live environment, but that is not always possible due to the live environment possibly being too dangerous or costly to train in. In cases like these offline training, which instead trains the agent on already pre-collected datasets of previously mentioned interactions and tries to learn a better policy than the one used for collection, offers a viable alternative by employing Q-Learning methods like CQL. However, prior studies, such as Mediratta et al., have suggested that Behavior Cloning (BC), a type of imitation cloning, may outperform modern offline RL methods in the multi-task setting, where model generalization is tested on new or similar tasks rather than the ones trained on. Considering these results, it begs the question whether it is worthwhile to employ modern Q-Learning methods designed to derive a better policy than the one used to collect the data, especially when they are unable to outperform standard imitation learning.
This study seeks to reproduce and extend these findings within a custom environment.
The results reveal that, contrary to the aforementioned report, BC does not consistently outperform CQL. Both machine learning methods exhibit comparable performance across datasets varying in diversity and size. Additionally, incorporating more diverse data significantly enhances generalization performance.