The Effect of State-visitation Mismatch on Off-policy Performance in Behaviour-agnostic Reinforcement Learning

None, None

The Effect of State-visitation Mismatch on Off-policy Performance in Behaviour-agnostic Reinforcement Learning

Bachelor Thesis (2024)

Author(s)

Kevin C. Chen (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

S.R. Bongers – Mentor (TU Delft - Sequential Decision Making)

F.A. Oliehoek – Mentor (TU Delft - Sequential Decision Making)

CM Jonker – Graduation committee member (TU Delft - Interactive Intelligence)

Faculty

Electrical Engineering, Mathematics and Computer Science

Computer Science AI Reinforment Learning Behaviour-agnostic State-visitation mismatch

To reference this document use:

https://resolver.tudelft.nl/uuid:f787c66b-bdf7-47b9-bf1b-24a8f389cfaa

More Info

expand_more

Publication Year

2024

Language

English

Graduation Date

27-06-2024

Awarding Institution

Delft University of Technology

Project

['CSE3000 Research Project']

Programme

['Computer Science and Engineering']

Faculty

Electrical Engineering, Mathematics and Computer Science

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Off-policy evaluation has some key problems with one of them being the “curse of horizon”. With recent breakthroughs [1] [2], new estimators have emerged that utilise importance sampling of the individual state-action pairs and reward rather than over the whole trajectory. With the difference between behaviour and target policy, the state-visitation mismatch occurs. This paper is interested in answering the question how the degree of state-visitation mismatch affects the overall target policy performance. The approach is to calculate the state-visitation mismatch with the KL divergence, which consists of the state-visitation distribution of the behaviour policy and the distribution correction ratio of the DICE estimator. The state-visitation mismatch can be quantified in way. Furthermore, the effect on the target policy performance is quantified by the MSE between the estimated, empirical cumulative reward and the estimated reward by the DICE estimator. By analysing the KL divergence and MSE value, one may argue that the state-visitation mismatch does impact the performance of the target policy but further research needs to be conducted.

Files

CSE3000_Research_Project_Paper... (pdf)

(pdf | 0.607 Mb)

License info not available