The Impact of Initial Start Distribution Mismatch on Policy Evaluation in Behavior-agnostic Reinforcement Learning

More Info
expand_more

Abstract

Behavior-agnostic reinforcement learning is a rapidly expanding research area focusing on developing algorithms capable of learning effective policies without explicit knowledge of the environment's dynamics or specific behavior policies. It proposes robust techniques to perform off-policy evaluation, namely Distribution Correction Estimation (DICE) methods, in the context of infinite horizon Markov Decision Processes (MDPs). This research paper investigates the impact of the initial start distribution mismatch on the accuracy of DICE estimators in behavior-agnostic reinforcement learning. To achieve this, seven systematic initial start distributions were created and utilized to calculate the initial start distribution mismatch via Kullback–Leibler (KL) divergence. Furthermore, off-policy evaluation performance was assessed using DICE estimators, with Mean Squared Error (MSE) comparisons against ground truth values. The study reveals that, based on the conducted experiments, the initial start distribution mismatch does not have a clear influence on the performance of the DICE estimators. Therefore, future research is required to increase the scope of the experiments and address some of the limitations of this study to accurately assess the impact of the initial start distribution mismatch on off-policy evaluation using DICE methods. This paper underscores the complexity of the initial start distribution choice in behavior-agnostic reinforcement learning, calling for further research to effectively evaluate its impact across diverse environments and measures. Additionally, exploring the relation between the initial start distribution and policies could provide deeper insights and further refine the understanding of their influence on DICE estimators.

Files