Impact of State Visitation Mismatch Methods on the Performance of On-Policy Reinforcement Learning

None, None

Impact of State Visitation Mismatch Methods on the Performance of On-Policy Reinforcement Learning

Bachelor Thesis (2024)

Author(s)

H. Cho (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

FA Oliehoek – Mentor (TU Delft - Sequential Decision Making)

S.R. Bongers – Mentor (TU Delft - Sequential Decision Making)

C.M. Jonker – Graduation committee member (TU Delft - Interactive Intelligence)

Faculty

Electrical Engineering, Mathematics and Computer Science

Reinforcement Learning Policy Evaluation Behavior Agnostic Distribution Correction Estimation

To reference this document use:

https://resolver.tudelft.nl/uuid:fe239fe8-8ae4-4970-8004-b69caaf4970d

More Info

expand_more

Publication Year

2024

Language

English

Graduation Date

27-06-2024

Awarding Institution

Delft University of Technology

Project

['CSE3000 Research Project']

Programme

['Computer Science and Engineering']

Faculty

Electrical Engineering, Mathematics and Computer Science

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

In the field of reinforcement learning (RL), effectively leveraging behavior-agnostic data to train and evaluate policies without explicit knowledge of the behavior policies that generated the data is a significant challenge. This research investigates the impact of state visitation mismatch methods on the performance of on-policy RL methods, an area crucial for improving policy performance in real-world applications where behavior policies are often unknown. Specifically, we compare the convergence speed and performance of Q-learning when initialized with Q-values learned through the Distribution Correction Estimation (DICE) method versus traditional random initialization. By generating datasets representing behavior and target policies, we employ the DICE estimator to initialize Q-values, and subsequently run Q-learning for both DICE-initialized and randomly-initialized scenarios. Our results demonstrate that initializing Q-learning with DICE Q-values enhances convergence speed, leading to faster attainment of near-optimal policies. This study provides valuable insights into the effectiveness of state visitation mismatch methods in improving the efficiency and performance of on-policy RL algorithms, contributing to the development of more robust RL applications in behavior-agnostic settings.

Files

Hongwoo_Final_Thesis.pdf

(pdf | 0.504 Mb)

License info not available