Impact of State Visitation Mismatch Methods on the Performance of On-Policy Reinforcement Learning

Bachelor Thesis (2024)
Author(s)

H. Cho (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

FA Oliehoek – Mentor (TU Delft - Sequential Decision Making)

S.R. Bongers – Mentor (TU Delft - Sequential Decision Making)

C.M. Jonker – Graduation committee member (TU Delft - Interactive Intelligence)

Faculty
Electrical Engineering, Mathematics and Computer Science
More Info
expand_more
Publication Year
2024
Language
English
Graduation Date
27-06-2024
Awarding Institution
Delft University of Technology
Project
CSE3000 Research Project
Programme
Computer Science and Engineering
Faculty
Electrical Engineering, Mathematics and Computer Science
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

In the field of reinforcement learning (RL), effectively leveraging behavior-agnostic data to train and evaluate policies without explicit knowledge of the behavior policies that generated the data is a significant challenge. This research investigates the impact of state visitation mismatch methods on the performance of on-policy RL methods, an area crucial for improving policy performance in real-world applications where behavior policies are often unknown. Specifically, we compare the convergence speed and performance of Q-learning when initialized with Q-values learned through the Distribution Correction Estimation (DICE) method versus traditional random initialization. By generating datasets representing behavior and target policies, we employ the DICE estimator to initialize Q-values, and subsequently run Q-learning for both DICE-initialized and randomly-initialized scenarios. Our results demonstrate that initializing Q-learning with DICE Q-values enhances convergence speed, leading to faster attainment of near-optimal policies. This study provides valuable insights into the effectiveness of state visitation mismatch methods in improving the efficiency and performance of on-policy RL algorithms, contributing to the development of more robust RL applications in behavior-agnostic settings.

Files

Hongwoo_Final_Thesis.pdf
(pdf | 0.504 Mb)
License info not available