M.T.J. Spaan | TU Delft Repository

Epistemic Bellman Operators

Journal article (2025) - P.R. van der Vaart (author) , Matthijs T. J. Spaan (author) , N. Yorke-Smith (author)

Uncertainty quantification remains a difficult challenge in reinforcement learning. Several algorithms exist that successfully quantify uncertainty in a practical setting. However it is unclear whether these algorithms are theoretically sound and can be expected to converge. Furt ...

Epistemic Monte Carlo Tree Search

Conference paper (2025) - Y. Oren (author) , Viliam Vadocz (author) , Matthijs T.J. Spaan (author) , Wendelin Böhmer (author)

The AlphaZero/MuZero (A/MZ) family of algorithms has achieved remarkable success across various challenging domains by integrating Monte Carlo Tree Search (MCTS) with learned models. Learned models introduce epistemic uncertainty, which is caused by learning from limited data and ...

Bayesian Ensembles for Exploration in Deep Q-Learning

Conference paper (2024) - P.R. van der Vaart (author) , N. Yorke-Smith (author) , Matthijs T. J. Spaan (author)

Exploration in reinforcement learning remains a difficult challenge. In order to drive exploration, ensembles with randomized prior functions have recently been popularized to quantify uncertainty in the value model. There is no theoretical reason for these ensembles to resemble ...

Value Improved Actor Critic Algorithms

Preprint (2024) - Yaniv Oren (author) , M.A. Zanger (author) , P.R. van der Vaart (author) , MTJ Spaan (author) , J.W. Böhmer (author)

Many modern reinforcement learning algorithms build on the actor-critic (AC) framework: iterative improvement of a policy (the actor) using policy improvement operators and iterative approximation of the policy's value (the critic). In contrast, the popular value-based algorithm ...

Scalable Safe Policy Improvement for Factored Multi-Agent MDPs

Conference paper (2024) - Federico Bianchi (author) , Edoardo Zorzi (author) , Alberto Castellini (author) , Thiago D. Simão (author) , M.T.J. Spaan (author) , Alessandro Farinelli (author)

In this work, we focus on safe policy improvement in multi-agent domains where current state-of-the-art methods cannot be effectively applied because of large state and action spaces. We consider recent results using Monte Carlo Tree Search for Safe Policy Improvement with Baseli ...

Diverse Projection Ensembles for Distributional Reinforcement Learning

Conference paper (2024) - M.A. Zanger (author) , Wendelin Böhmer (author) , Matthijs T. J. Spaan (author)

In contrast to classical reinforcement learning, distributional RL algorithms aim to learn the distribution of returns rather than their expected value. Since the nature of the return distribution is generally unknown a priori or arbitrarily complex, a common approach finds appro ...

Bayesian Meta-Reinforcement Learning with Laplace Variational Recurrent Networks

Preprint (2024) - J.A. de Vries (author) , Jinke He (author) , Mathijs M. De Weerdt (author) , Matthijs Spaan (author)

Reinforcement Learning by Guided Safe Exploration

Conference paper (2023) - Qisong Yang (author) , Thiago D. Simão (author) , Nils Jansen (author) , Simon Tindemans (author) , Matthijs T. J. Spaan (author)

Safety is critical to broadening the application of reinforcement learning (RL). Often, we train RL agents in a controlled environment, such as a laboratory, before deploying them in the real world. However, the real-world target task might be unknown prior to deployment. Reward- ...

Scalable Safe Policy Improvement via Monte Carlo Tree Search

Journal article (2023) - Alberto Castellini (author) , Federico Bianchi (author) , Edoardo Zorzi (author) , Thiago D. Simão (author) , Alessandro Farinelli (author) , M.T.J. Spaan (author)

Algorithms for safely improving policies are important to deploy reinforcement learning approaches in real-world scenarios. In this work, we propose an algorithm, called MCTS-SPIBB, that computes safe policy improvement online using a Monte Carlo Tree Search based strategy. We th ...

E-MCTS: Deep Exploration in Model-Based Reinforcement Learning by Planning with Epistemic Uncertainty

Preprint (2023) - Y. Oren (author) , Matthijs T. J. Spaan (author) , Wendelin Böhmer (author)

One of the most well-studied and highly performing planning approaches used in Model-Based Reinforcement Learning (MBRL) is Monte-Carlo Tree Search (MCTS). Key challenges of MCTS-based MBRL methods remain dedicated deep exploration and reliability in the face of the unknown, and ...

Diverse Projection Ensembles for Distributional Reinforcement Learning

Conference paper (2023) - M.A. Zanger (author) , Wendelin Böhmer (author) , Matthijs T. J. Spaan (author)

In contrast to classical reinforcement learning, distributional reinforcement learning algorithms aim to learn the distribution of returns rather than their expected value. Since the nature of the return distribution is generally unknown a priori or arbitrarily complex, a common ...

Bad Habits: Policy Confounding and Out-of-Trajectory Generalization in RL

Preprint (2023) - M. Suau de Castro (author) , MTJ Spaan (author) , F.A. Oliehoek (author)

Reinforcement learning agents may sometimes develop habits that are effective only when specific policies are followed. After an initial exploration phase in which agents try out different actions, they eventually converge toward a particular policy. When this occurs, the distrib ...

CEM: Constrained Entropy Maximization for Task-Agnostic Safe Exploration

Conference paper (2023) - Qisong Yang (author) , Matthijs T. J. Spaan (author)

Without an assigned task, a suitable intrinsic objective for an agent is to explore the environment efficiently. However, the pursuit of exploration will inevitably bring more safety risks.
An under-explored aspect of reinforcement learning is how to achieve safe efficient ex ...

The Role of Diverse Replay for Generalisation in Reinforcement Learning

Preprint (2023) - M.R. Weltevrede (author) , Matthijs T. J. Spaan (author) , J.W. Böhmer (author)

In reinforcement learning (RL), key components of many algorithms are the exploration strategy and replay buffer. These strategies regulate what environment data is collected and trained on and have been extensively studied in the RL literature. In this paper, we investigate the ...

Refined Risk Management in Safe Reinforcement Learning with a Distributional Safety Critic

Conference paper (2022) - Q. Yang (author) , Thiago D. Simão (author) , Simon H. Tindemans (author) , Matthijs TJ Spaan (author)

Safety is critical to broadening the real-world use of reinforcement learning (RL). Modeling the safety aspects using a safety-cost signal separate from the reward is becoming standard practice, since it avoids the problem of finding a good balance between safety and performance. ...

Abstraction-Refinement for Hierarchical Probabilistic Models

Conference paper (2022) - Sebastian Junges (author) , M.T.J. Spaan (author)

Markov decision processes are a ubiquitous formalism for modelling systems with non-deterministic and probabilistic behavior. Verification of these models is subject to the famous state space explosion problem. We alleviate this problem by exploiting a hierarchical structure with ...

An Auction-Based Multi-Agent System for the Pickup and Delivery Problem with Autonomous Vehicles and Alternative Locations

Conference paper (2022) - J. Los (author) , F. Schulte (author) , Matthijs Spaan (author) , RR Negenborn (author)

The trends of autonomous transportation and mobility on demand in line with large numbers of requests increasingly call for decentralized vehicle routing optimization. Multi-agent systems (MASs) allow to model fully autonomous decentralized decision making, but are rarely conside ...

Training and Transferring Safe Policies in Reinforcement Learning

Conference paper (2022) - Qisong Yang (author) , Thiago D. Simão (author) , Nils Jansen (author) , Simon Tindemans (author) , Matthijs T. J. Spaan (author)

Safety is critical to broadening the a lication of reinforcement learning (RL). Often, RL agents are trained in a controlled environment, such as a laboratory, before being de loyed in the real world. However, the target reward might be unknown rior to de loyment. Reward-free R ...

Strategic Bidding in Decentralized Collaborative Vehicle Routing

Conference paper (2022) - J. Los (author) , F. Schulte (author) , Matthijs Spaan (author) , RR Negenborn (author)

Collaboration in transportation is important to reduce costs and emissions, but carriers may have incentives to bid strategically in decentralized auction systems. We investigate what the effect of the auction strategy is on the possible cheating benefits in a dynamic context, su ...

Large-scale collaborative vehicle routing

Journal article (2022) - J. Los (author) , F. Schulte (author) , Margaretha Gansterer (author) , Richard F. Hartl (author) , Matthijs T. J. Spaan (author) , R. R. Negenborn (author)

Carriers can remarkably reduce transportation costs and emissions when they collaborate, for example through a platform. Such gains, however, have only been investigated for relatively small problem instances with low numbers of carriers. We develop auction-based methods for larg ...