SimuDICE: Offline Policy Optimization Through Iterative World Model Updates and DICE Estimation

Bachelor thesis (2024)

Authors

C. Brita Electrical Engineering, Mathematics and Computer Science

Contributors

F.A. Oliehoek Sequential Decision Making - (mentor)

S.R. Bongers Sequential Decision Making - (mentor)

C.M. Jonker Interactive Intelligence - (graduation committee member)

Faculty

Electrical Engineering, Mathematics and Computer Science

Offline Reinforcement Learning Model-Based Reinforcement Learning DICE Estimation

More Info

expand_more

To reference this document use:

http://resolver.tudelft.nl/uuid:b2bd24d0-d8d0-40bd-bb06-703af7889230

Published Date

27-06-2024

Language

English

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Faculty

Electrical Engineering, Mathematics and Computer Science

Abstract

In offline reinforcement learning, deriving a policy from a pre-collected set of experiences is challenging due to the limited sample size and the mismatched state-action distribution between the target policy and the behavioral policy that generated the data. Learning a dynamic model of the environment can improve the sample efficiency of the algorithm, but this mismatch can lead to the generation of suboptimal experiences. We propose SimuDICE, an algorithm that enhances the sampling of imaginary experiences using Dual stationary DIstribution Correction (DICE), and iteratively improves the DICE estimations with synthetically generated experiences. SimuDICE addresses the objective mismatch issue by iteratively updating both the world model and the DICE estimator, aligning the model's training objective (imitating the environment) with its usage objective (policy improvement). We show that SimuDICE requires less pre-collected data and fewer simulated experiences to achieve comparable results to other algorithms while having greater robustness to lower data quality.

Files

SimuDICE.pdf

(.pdf | 0.965 Mb)