The Impact of Initial Start Distribution Mismatch on Policy Evaluation in Behavior-agnostic Reinforcement Learning

None, None

The Impact of Initial Start Distribution Mismatch on Policy Evaluation in Behavior-agnostic Reinforcement Learning

Bachelor Thesis (2024)

Author(s)

T. Sabău (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

F.A. Oliehoek – Mentor (TU Delft - Sequential Decision Making)

S.R. Bongers – Mentor (TU Delft - Sequential Decision Making)

C.M. Jonker – Graduation committee member (TU Delft - Interactive Intelligence)

Faculty

Electrical Engineering, Mathematics and Computer Science

Reinforcement Learning Behavior-agnostic Start Distribution DICE Estimators

To reference this document use:

https://resolver.tudelft.nl/uuid:dc6e9cbb-74eb-4003-bfeb-15c079f7d8a5

More Info

expand_more

Publication Year

2024

Language

English

Graduation Date

27-06-2024

Awarding Institution

Delft University of Technology

Project

['CSE3000 Research Project']

Programme

['Computer Science and Engineering']

Faculty

Electrical Engineering, Mathematics and Computer Science

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Behavior-agnostic reinforcement learning is a rapidly expanding research area focusing on developing algorithms capable of learning effective policies without explicit knowledge of the environment's dynamics or specific behavior policies. It proposes robust techniques to perform off-policy evaluation, namely Distribution Correction Estimation (DICE) methods, in the context of infinite horizon Markov Decision Processes (MDPs). This research paper investigates the impact of the initial start distribution mismatch on the accuracy of DICE estimators in behavior-agnostic reinforcement learning. To achieve this, seven systematic initial start distributions were created and utilized to calculate the initial start distribution mismatch via Kullback–Leibler (KL) divergence. Furthermore, off-policy evaluation performance was assessed using DICE estimators, with Mean Squared Error (MSE) comparisons against ground truth values. The study reveals that, based on the conducted experiments, the initial start distribution mismatch does not have a clear influence on the performance of the DICE estimators. Therefore, future research is required to increase the scope of the experiments and address some of the limitations of this study to accurately assess the impact of the initial start distribution mismatch on off-policy evaluation using DICE methods. This paper underscores the complexity of the initial start distribution choice in behavior-agnostic reinforcement learning, calling for further research to effectively evaluate its impact across diverse environments and measures. Additionally, exploring the relation between the initial start distribution and policies could provide deeper insights and further refine the understanding of their influence on DICE estimators.

Files

Research_paper.pdf

(pdf | 0.492 Mb)

License info not available