Planning agents have demonstrated superhuman performance in deterministic environments, such as chess and Go, by combining end-to-end reinforcement learning with powerful tree-based search algorithms. To extend such agents to stochastic or partially observable domains, Stochastic
...
Planning agents have demonstrated superhuman performance in deterministic environments, such as chess and Go, by combining end-to-end reinforcement learning with powerful tree-based search algorithms. To extend such agents to stochastic or partially observable domains, Stochastic MuZero leveraged a framework that models environment uncertainty by splitting transitions into agent actions and learned stochastic outcomes. In this paper, we propose a novel architecture, FlowZero, which builds on this idea but replaces the discrete latent modeling of environment stochasticity with Conditional Normalizing Flows (CNF). This allows the model to learn a rich, continuous probability distribution over possible future states conditioned on the afterstate. The key advantage of this approach is its ability to perform exact log-likelihood evaluation, offering more precise density estimation than the evidence lower bound (ELBO) used in Stochastic MuZero. We aim to verify the proposed CNF’s capacity to overfit data and generalize to similar and larger data, and our novel agent FlowZero’s capacity to perform in a stochastic environment.