SimuDICE: Offline Policy Optimization Through Iterative World Model Updates and DICE Estimation

More Info
expand_more

Abstract

In offline reinforcement learning, deriving a policy from a pre-collected set of experiences is challenging due to the limited sample size and the mismatched state-action distribution between the target policy and the behavioral policy that generated the data. Learning a dynamic model of the environment can improve the sample efficiency of the algorithm, but this mismatch can lead to the generation of suboptimal experiences. We propose SimuDICE, an algorithm that enhances the sampling of imaginary experiences using Dual stationary DIstribution Correction (DICE), and iteratively improves the DICE estimations with synthetically generated experiences. SimuDICE addresses the objective mismatch issue by iteratively updating both the world model and the DICE estimator, aligning the model's training objective (imitating the environment) with its usage objective (policy improvement). We show that SimuDICE requires less pre-collected data and fewer simulated experiences to achieve comparable results to other algorithms while having greater robustness to lower data quality.

Files