Impact of State Visitation Mismatch Methods on the Performance of On-Policy Reinforcement Learning
H. Cho (TU Delft - Electrical Engineering, Mathematics and Computer Science)
FA Oliehoek – Mentor (TU Delft - Sequential Decision Making)
S.R. Bongers – Mentor (TU Delft - Sequential Decision Making)
C.M. Jonker – Graduation committee member (TU Delft - Interactive Intelligence)
More Info
expand_more
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.
Abstract
In the field of reinforcement learning (RL), effectively leveraging behavior-agnostic data to train and evaluate policies without explicit knowledge of the behavior policies that generated the data is a significant challenge. This research investigates the impact of state visitation mismatch methods on the performance of on-policy RL methods, an area crucial for improving policy performance in real-world applications where behavior policies are often unknown. Specifically, we compare the convergence speed and performance of Q-learning when initialized with Q-values learned through the Distribution Correction Estimation (DICE) method versus traditional random initialization. By generating datasets representing behavior and target policies, we employ the DICE estimator to initialize Q-values, and subsequently run Q-learning for both DICE-initialized and randomly-initialized scenarios. Our results demonstrate that initializing Q-learning with DICE Q-values enhances convergence speed, leading to faster attainment of near-optimal policies. This study provides valuable insights into the effectiveness of state visitation mismatch methods in improving the efficiency and performance of on-policy RL algorithms, contributing to the development of more robust RL applications in behavior-agnostic settings.