P.R. van der Vaart | TU Delft Repository

Bayesian Model-Free Deep Reinforcement Learning

Doctoral thesis (2026) - P.R. van der Vaart, M.T.J. Spaan, N. Yorke-Smith

The goal of reinforcement learning is to train agents to perform tasks under little supervision. Tasks are specified by a reward function and transition function, which state how much reward the agent gets for its action in a state, and how the environment state changes based on the action the agent took. Typically reinforcement learning assumes no prior knowledge over the reward and transition function,meaning that agents need to explore the environment and learn essentially through trial and error. Model-free methods attempt to learn which actions lead to good outcomes without modeling the reward or environments itself. Efficiently selecting that actions are promising is an active research direction which can greatly reduce the number of total interactions needed for an agent to learn the task, potentially opening the door to new applications where trials or simulations are expensive or compute is limited.
Uncertainty quantification is a central mechanism in such efficient exploration methods. Provided with an estimate of how certain the agent is about the outcome of an action, it can intelligently weigh whether it is worth exploring. The Bayesian paradigm is one method to quantify uncertainty in machine learning. It models the uncertainty with a probability distributions over models, specifying how likely a model is based on the data the agent has collected.
We adopt a Bayesian point of view in model-free reinforcement learning, and develop a deeper understanding on when Bayesian reinforcement learning methods can be expected to work well and challenges that remain. To this end, in Chapter 2 we propose training ensembles through Sequential Monte Carlo, obtaining a sample from the posterior distribution of a deep Q-learning agent. We observe that agents are able to perform directed exploration, although not necessarily more efficiently than standard ensembles in every environment. Furthermore, in Chapter 3 we theoretically analyze existing Bayesian Deep model-Free Reinforcement Learning methods, and unify them into a single theoretical framework we call Epistemic Bellman Operators. We prove that these operators are contractions, establishing convergence of derived algorithms in a simplified setting. Finally, in Chapter 4 we analyze the likelihood and prior assumptions
in existing Bayesian deep model-free reinforcement learning methods, and find through statistical tests that the standard likelihood assumptions are violated on every benchmark we tested. We also find that we can improve performance of Bayesian model free reinforcement learning methods by picking different priors based on empirical data from unrelated tasks, which transfer to new environments.
This dissertation establishes several desirable properties of Bayesian Deep model free reinforcement learning, but also raises some key issues, most notably misspecification in Chapter 4. We hope our findings convince other Bayesian reinforcement learning researchers to give more attention to assumptions about priors and likelihoods. ...

The goal of reinforcement learning is to train agents to perform tasks under little supervision. Tasks are specified by a reward function and transition function, which state how much reward the agent gets for its action in a state, and how the environment state changes based on the action the agent took. Typically reinforcement learning assumes no prior knowledge over the reward and transition function,meaning that agents need to explore the environment and learn essentially through trial and error. Model-free methods attempt to learn which actions lead to good outcomes without modeling the reward or environments itself. Efficiently selecting that actions are promising is an active research direction which can greatly reduce the number of total interactions needed for an agent to learn the task, potentially opening the door to new applications where trials or simulations are expensive or compute is limited.
Uncertainty quantification is a central mechanism in such efficient exploration methods. Provided with an estimate of how certain the agent is about the outcome of an action, it can intelligently weigh whether it is worth exploring. The Bayesian paradigm is one method to quantify uncertainty in machine learning. It models the uncertainty with a probability distributions over models, specifying how likely a model is based on the data the agent has collected.
We adopt a Bayesian point of view in model-free reinforcement learning, and develop a deeper understanding on when Bayesian reinforcement learning methods can be expected to work well and challenges that remain. To this end, in Chapter 2 we propose training ensembles through Sequential Monte Carlo, obtaining a sample from the posterior distribution of a deep Q-learning agent. We observe that agents are able to perform directed exploration, although not necessarily more efficiently than standard ensembles in every environment. Furthermore, in Chapter 3 we theoretically analyze existing Bayesian Deep model-Free Reinforcement Learning methods, and unify them into a single theoretical framework we call Epistemic Bellman Operators. We prove that these operators are contractions, establishing convergence of derived algorithms in a simplified setting. Finally, in Chapter 4 we analyze the likelihood and prior assumptions
in existing Bayesian deep model-free reinforcement learning methods, and find through statistical tests that the standard likelihood assumptions are violated on every benchmark we tested. We also find that we can improve performance of Bayesian model free reinforcement learning methods by picking different priors based on empirical data from unrelated tasks, which transfer to new environments.
This dissertation establishes several desirable properties of Bayesian Deep model free reinforcement learning, but also raises some key issues, most notably misspecification in Chapter 4. We hope our findings convince other Bayesian reinforcement learning researchers to give more attention to assumptions about priors and likelihoods.

Epistemic Bellman Operators

Journal article (2025) - Pascal R. van der Vaart, Matthijs T.J. Spaan, Neil Yorke-Smith

Uncertainty quantification remains a difficult challenge in reinforcement learning. Several algorithms exist that successfully quantify uncertainty in a practical setting. However it is unclear whether these algorithms are theoretically sound and can be expected to converge. Furthermore, they seem to treat the uncertainty in the target parameters in different ways. In this work, we unify several practical algorithms into one theoretical framework by defining a new Bellman operator on distributions, and show that this Bellman operator is a contraction. We highlight use cases of our framework by analyzing an existing Bayesian Q-learning algorithm, and also introduce a novel uncertainty-aware variant of PPO that adaptively sets its clipping hyperparameter. ...

Value Improved Actor Critic Algorithms

Preprint (2024) - Y. Oren, M.A. Zanger, P.R. van der Vaart, M.T.J. Spaan, J.W. Böhmer

Many modern reinforcement learning algorithms build on the actor-critic (AC) framework: iterative improvement of a policy (the actor) using policy improvement operators and iterative approximation of the policy's value (the critic). In contrast, the popular value-based algorithm family employs improvement operators in the value update, to iteratively improve the value function directly. In this work, we propose a general extension to the AC framework that employs two separate improvement operators: one applied to the policy in the spirit of policy-based algorithms and one applied to the value in the spirit of value-based algorithms, which we dub Value-Improved AC (VI-AC). We design two practical VI-AC algorithms based in the popular online off-policy AC algorithms TD3 and DDPG. We evaluate VI-TD3 and VI-DDPG in the Mujoco benchmark and find that both improve upon or match the performance of their respective baselines in all environments tested ...

Bayesian Ensembles for Exploration in Deep Q-Learning

Conference paper (2024) - P.R. van der Vaart, N. Yorke-Smith, M.T.J. Spaan

Exploration in reinforcement learning remains a difficult challenge. In order to drive exploration, ensembles with randomized prior functions have recently been popularized to quantify uncertainty in the value model. There is no theoretical reason for these ensembles to resemble the actual posterior, however. In this work, we view training ensembles from the perspective of Sequential Monte Carlo, a Monte Carlo method that approximates a sequence of distributions with a set of particles. In particular, we propose an algorithm that exploits both the practical flexibility of ensembles and theory of the Bayesian paradigm. We incorporate this method into a standard Deep Q-learning agent (DQN) and experimentally show qualitatively good uncertainty quantification and improved exploration capabilities over a regular ensemble. ...

Bayesian Model-Free Deep Reinforcement Learning

Conference paper (2024) - P.R. van der Vaart

Exploration in reinforcement learning remains a difficult challenge. In order to drive exploration, ensembles with randomized prior functions have recently been popularized to quantify uncertainty in the value model. However these ensembles have no theoretical reason to resemble the actual Bayesian posterior, which is known to provide strong performance in theory under certain conditions. In this thesis work, we view training ensembles from the perspective of Sequential Monte Carlo, a Monte Carlo method that approximates a sequence of distributions with a set of particles, and propose an algorithm that exploits both the practical flexibility of ensembles and theory of the Bayesian paradigm. We incorporate this method into a standard DQN agent and experimentally show qualitatively good uncertainty quantification and improved exploration capabilities over a regular ensemble. In the future, we will investigate the impact of likelihood and prior choices in Bayesian model-free reinforcement learning methods. ...