Generalization in Offline Reinforcement Learning: Comparing Implicit Q-Learning with Behavioral Cloning

None, None

Generalization in Offline Reinforcement Learning: Comparing Implicit Q-Learning with Behavioral Cloning

Bachelor Thesis (2024)

Author(s)

J.J. Tarazona Rodríguez (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

M.T.J. Spaan – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

M.R. Weltevrede – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

E. Congeduti – Graduation committee member (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Faculty

Electrical Engineering, Mathematics and Computer Science

Reinforcement Learning (RL) Generalization Discrete action-space Offline Reinforcement Learning

To reference this document use

https://resolver.tudelft.nl/uuid:0552e07f-85d3-411f-9ab9-bf5a94e11fdf

More Info

expand_more

Publication Year

2024

Language

English

Graduation Date

27-06-2024

Awarding Institution

Delft University of Technology

Project

CSE3000 Research Project

Programme

Computer Science and Engineering

Faculty

Electrical Engineering, Mathematics and Computer Science

Downloads counter

940

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Offline Reinforcement Learning (Offline RL) involves learning policies from a static dataset without further interactions with the environment, making it suitable for high-stakes scenarios where data collection is costly or risky. This paper investigates the generalization capabilities of Implicit Q-Learning (IQL), an offline RL algorithm, compared to Behavioral Cloning (BC). We adapt the IQL algorithm for discrete control and evaluate both IQL and BC in a four-room environment using training datasets generated from different behavioral policies. Performance is assessed based on average rewards over various test seeds, on reachable and unreachable tasks, as well as the training set. Our results indicate that BC consistently outperforms IQL across all scenarios, although IQL reaches peak performance faster. This study highlights the need for further research into offline RL algorithms for better generalization and more robust performance in diverse environments. Full code available on GitHub.

Files

CSE3000_FINAL_Paper_Tarazona.p... (pdf)

(pdf | 5.52 Mb)

License info not available