SimuDICE: Offline Policy Optimization Through Iterative World Model Updates and DICE Estimation
C. Brita (TU Delft - Electrical Engineering, Mathematics and Computer Science)
FA Oliehoek – Mentor (TU Delft - Sequential Decision Making)
S.R. Bongers – Mentor (TU Delft - Sequential Decision Making)
C.M. Jonker – Graduation committee member (TU Delft - Interactive Intelligence)
More Info
expand_more
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.
Abstract
In offline reinforcement learning, deriving a policy from a pre-collected set of experiences is challenging due to the limited sample size and the mismatched state-action distribution between the target policy and the behavioral policy that generated the data. Learning a dynamic model of the environment can improve the sample efficiency of the algorithm, but this mismatch can lead to the generation of suboptimal experiences. We propose SimuDICE, an algorithm that enhances the sampling of imaginary experiences using Dual stationary DIstribution Correction (DICE), and iteratively improves the DICE estimations with synthetically generated experiences. SimuDICE addresses the objective mismatch issue by iteratively updating both the world model and the DICE estimator, aligning the model's training objective (imitating the environment) with its usage objective (policy improvement). We show that SimuDICE requires less pre-collected data and fewer simulated experiences to achieve comparable results to other algorithms while having greater robustness to lower data quality.