F.A. Oliehoek | TU Delft Repository

A timely match for ride-hailing and ride-pooling services using a deep reinforcement learning approach

Journal article (2026) - Yiman Bao, Jie Gao, Jinke He, Frans A. Oliehoek, Oded Cats

Efficient matching in ride-hailing and ride-pooling services depends not only on how matches are constructed, but also on when the platform triggers a matching operation. Many systems use batched matching with a fixed time interval to accumulate requests before matching, which increases the candidate set but cannot adapt to real time supply-demand fluctuations and may induce unnecessary waiting. This paper proposes a reinforcement learning approach that learns when to trigger matching based on current system conditions. We formulate the timing problem as a finite-horizon Markov decision process and train the policy using the Proximal Policy Optimization algorithm. To address sparse and delayed feedback, we introduce a finite-horizon, potential-based reward shaping scheme that preserves the optimal policy while densifying the learning signal; the same framework applies to both ride-hailing and ride-pooling, where detour delay is incorporated into the reward for pooling. Using a data-driven simulator calibrated on NYC trip records, the learned policy adapts matching timing decisions to the current state of waiting requests and available drivers and outperforms fixed-interval, rule-based dynamic, and first-dispatch baselines. It reduces total waiting time by 3.1% in ride-hailing and 20.1% in ride-pooling, and detour delay by 36.1% in pooling, while maintaining short matching times. ...

Preface

Foreword postscript (2025) - Frans A. Oliehoek, Manon Kok, Sicco Verwer

In this volume, we are happy present the post-proceedings of BNAIC/BeNeLearn 2023, the joint conference on Artificial Intelligence and Machine Learning in the BeNeLux, which took place at TU Delft. It is the main regional conference on these topics and has a long tradition: in 2018, the 30th Benelux Conference on Artificial Intelligence (BNAIC) and the 27th Belgian Dutch Conference on Machine Learning (Benelearn) were jointly organized in ‘s Hertogenbosch, and this has been repeated annually since. [...] ...

Towards an Experimentation Platform for Hybrid Human-AI Sequential Decision-Making

Journal article (2025) - Hüseyin Aydin, Kevin Godin-Dubois, Libio Goncalvez Braz, Floris Den Hengst, Kim Baraka, Mustafa Mert Çelikok, Andreas Sauter, Shihan Wang, Frans A. Oliehoek

We present SHARPIE (Shared Human-AI Reinforcement Learning Platform for Interactive Experiments), a generic framework to support experiments with RL agents and humans. It consists of a versatile wrapper for RL environments and algorithm libraries, a participant-facing web interface, logging utilities, and deployment on popular cloud and participant recruitment platforms. It empowers researchers to study a wide variety of research questions related to the interaction between humans and RL agents and aims to standardize the field of study on RL in human contexts. ...

Influence Learning in Complex Systems

Other (2025) - Elena Congeduti, Roberto Rocchetta, Frans A. Oliehoek

High sample complexity hampers the successful application of reinforcement learning methods, especially in real-world problems where simulating complex dynamics is computationally demanding. Influence-based abstraction (IBA) was proposed to mitigate this issue by breaking down the global model of large-scale distributed systems, such as traffic control problems, into small local sub-models. Each local model includes only a few state variables and a representation of the influence exerted by the external portion of the system. This approach allows converting a complex simulator into local lightweight simulators, enabling more effective applications of planning and reinforcement learning methods. However, the effectiveness of IBA critically depends on the ability to accurately approximate the influence of each local model. While there are a few examples showing promising results in benchmark problems, the question of whether this approach is feasible in more practical scenarios remains open. In this work, we take steps towards addressing this question by conducting an extensive empirical study of learning models for influence approximations in various realistic domains, and evaluating how these models generalize over long horizons. We find that learning the influence is often a manageable learning task, even for complex and large systems. Additionally, we demonstrate the efficacy of the approximation models for long-horizon problems. By using short trajectories, we can learn accurate influence approximations for much longer horizons. ...

What model does MuZero learn?

Conference paper (2024) - Jinke He, Thomas M Moerland, Joery A de Vries, Frans A Oliehoek

Model-based reinforcement learning (MBRL) has drawn considerable interest in recent years, given its promise to improve sample efficiency. Moreover, when using deep-learned models, it is possible to learn compact and generalizable models from data. In this work, we study MuZero, a state-of-the-art deep model-based reinforcement learning algorithm that distinguishes itself from existing algorithms by learning a value-equivalent model. Despite MuZero’s success and impact in the field of MBRL, existing literature has not thoroughly addressed why MuZero performs so well in practice. Specifically, there is a lack of in-depth investigation into the value-equivalent model learned by MuZero and its effectiveness in model-based credit assignment and policy improvement, which is vital for achieving sample efficiency in MBRL. To fill this gap, we explore two fundamental questions through our empirical analysis: 1) to what extent does MuZero achieve its learning objective of a value-equivalent model, and 2) how useful are these models for policy improvement? Among various other insights, we conclude that MuZero’s learned model cannot effectively generalize to evaluate unseen policies. This limitation constrains the extent to which we can additionally improve the current policy by planning with the model. ...

Policy Space Response Oracles

A Survey

Conference paper (2024) - A. Bighashdel, Yongzhao Wang, Stephen McAleer, Rahul Savani, F.A. Oliehoek

Game theory provides a mathematical way to study the interaction between multiple decision makers. However, classical game-theoretic analysis is limited in scalability due to the large number of strategies, precluding direct application to more complex scenarios. This survey provides a comprehensive overview of a framework for large games, known as Policy Space Response Oracles (PSRO), which holds promise to improve scalability by focusing attention on sufficient subsets of strategies. We first motivate PSRO and provide historical context. We then focus on the strategy exploration problem for PSRO: the challenge of assembling effective subsets of strategies that still represent the original game well with minimum computational cost. We survey current research directions for enhancing the efficiency of PSRO, and explore the applications of PSRO across various domains. We conclude by discussing open questions and future research. ...

Teacher-apprentices RL (TARL)

Leveraging complex policy distribution through generative adversarial hypernetwork in reinforcement learning

Journal article (2023) - Shi Yuan Tang, Athirai A. Irissappane, Frans A. Oliehoek, Jie Zhang

Typically, a Reinforcement Learning (RL) algorithm focuses in learning a single deployable policy as the end product. Depending on the initialization methods and seed randomization, learning a single policy could possibly leads to convergence to different local optima across different runs, especially when the algorithm is sensitive to hyper-parameter tuning. Motivated by the capability of Generative Adversarial Networks (GANs) in learning complex data manifold, the adversarial training procedure could be utilized to learn a population of good-performing policies instead. We extend the teacher-student methodology observed in the Knowledge Distillation field in typical deep neural network prediction tasks to RL paradigm. Instead of learning a single compressed student network, an adversarially-trained generative model (hypernetwork) is learned to output network weights of a population of good-performing policy networks, representing a school of apprentices. Our proposed framework, named Teacher-Apprentices RL (TARL), is modular and could be used in conjunction with many existing RL algorithms. We illustrate the performance gain and improved robustness by combining TARL with various types of RL algorithms, including direct policy search Cross-Entropy Method, Q-learning, Actor-Critic, and policy gradient-based methods. ...

A Survey on Scenario Theory, Complexity, and Compression-Based Learning and Generalization

Journal article (2023) - Roberto Rocchetta, Alexander Mey, Frans Oliehoek

This work investigates formal generalization error bounds that apply to support vector machines (SVMs) in realizable and agnostic learning problems. We focus on recently observed parallels between probably approximately correct (PAC)-learning bounds, such as compression and complexity-based bounds, and novel error guarantees derived within scenario theory. Scenario theory provides nonasymptotic and distributional-free error bounds for models trained by solving data-driven decision-making problems. Relevant theorems and assumptions are reviewed and discussed. We propose a numerical comparison of the tightness and effectiveness of theoretical error bounds for support vector classifiers trained on several randomized experiments from 13 real-life problems. This analysis allows for a fair comparison of different approaches from both conceptual and experimental standpoints. Based on the numerical results, we argue that the error guarantees derived from scenario theory are often tighter for realizable problems and always yield informative results, i.e., probability bounds tighter than a vacuous [0, 1] interval. This work promotes scenario theory as an alternative tool for model selection, structural-risk minimization, and generalization error analysis of SVMs. In this way, we hope to bring the communities of scenario and statistical learning theory closer, so that they can benefit from each other's insights. ...

Safe Multi-agent Learning via Trapping Regions

Conference paper (2023) - Aleksander Czechowski, Frans A. Oliehoek

One of the main challenges of multi-agent learning lies in establishing convergence of the algorithms, as, in general, a collection of individual, self-serving agents is not guaranteed to converge with their joint policy, when learning concurrently. This is in stark contrast to most single-agent environments, and sets a prohibitive barrier for deployment in practical applications, as it induces uncertainty in long term behavior of the system. In this work, we apply the concept of trapping regions, known from qualitative theory of dynamical systems, to create safety sets in the joint strategy space for decentralized learning. We propose a binary partitioning algorithm for verification that candidate sets form trapping regions in systems with known learning dynamics, and a heuristic sampling algorithm for scenarios where learning dynamics are not known. We demonstrate the applications to a regularized version of Dirac Generative Adversarial Network, a four-intersection traffic control scenario run in a state of the art open-source microscopic traffic simulator SUMO, and a mathematical model of economic competition. ...

What Lies beyond the Pareto Front? A Survey on Decision-Support Methods for Multi-Objective Optimization

Conference paper (2023) - Zuzanna Osika, Jazmin Zatarain Salazar, Diederik M. Roijers, Frans A. Oliehoek, Pradeep K. Murukannaiah

We present a review that unifies decision-support methods for exploring the solutions produced by multi-objective optimization (MOO) algorithms. As MOO is applied to solve diverse problems, approaches for analyzing the trade-offs offered by MOO algorithms are scattered across fields. We provide an overview of the advances on this topic, including methods for visualization, mining the solution set, and uncertainty exploration as well as emerging research directions, including interactivity, explainability, and ethics. We synthesize these methods drawing from different fields of research to build a unified approach, independent of the application. Our goals are to reduce the entry barrier for researchers and practitioners on using MOO algorithms and to provide novel research directions. ...

Bad Habits: Policy Confounding and Out-of-Trajectory Generalization in RL

Preprint (2023) - M. Suau, M.T.J. Spaan, F.A. Oliehoek

Reinforcement learning agents may sometimes develop habits that are effective only when specific policies are followed. After an initial exploration phase in which agents try out different actions, they eventually converge toward a particular policy. When this occurs, the distribution of state-action trajectories becomes narrower, and agents start experiencing the same transitions again and again. At this point, spurious correlations may arise. Agents may then pick up on these correlations and learn state representations that do not generalize beyond the agent’s trajectory distribution. In this paper, we provide a mathematical characterization of this phenomenon, which we refer to as policy confounding, and show, through a series of examples, when and how it occurs in practice. ...

An Analysis of Model-Based Reinforcement Learning From Abstracted Observations

Journal article (2023) - R.A.N. Starre, M. Loog, E. Congeduti, F.A. Oliehoek

Many methods for Model-based Reinforcement learning (MBRL) in Markov decision processes (MDPs) provide guarantees for both the accuracy of the model they can deliver and the learning efficiency. At the same time, state abstraction techniques allow for a reduction of the size of an MDP while maintaining a bounded loss with respect to the original problem. Therefore, it may come as a surprise that no such guarantees are available when combining both techniques, i.e., where MBRL merely observes abstract states. Our theoretical analysis shows that abstraction can introduce a dependence between samples collected online (e.g., in the real world). That means that, without taking this dependence into
account, results for MBRL do not directly extend to this setting. Our result shows that we can use concentration inequalities for martingales to overcome this problem. This result makes it possible to extend the guarantees of existing MBRL algorithms to the setting with abstraction. We illustrate this by combining R-MAX, a prototypical MBRL algorithm, with abstraction, thus producing the first performance guarantees for model-based ‘RL from Abstracted Observations’: model-based reinforcement learning with an abstract model. ...

Safety Guarantees in Multi-agent Learning via Trapping Regions

Journal article (2023) - Aleksander Czechowski, Frans A. Oliehoek

One of the main challenges of multi-agent learning lies in establishing convergence of the algorithms, as, in general, a collection of individual, self-serving agents is not guaranteed to converge with their joint policy, when learning concurrently. This is in stark contrast to most single-agent environments, and sets a prohibitive barrier for deployment in practical applications, as it induces uncertainty in long term behavior of the system. In this work, we propose to apply the concept of trapping regions, known from qualitative theory of dynamical systems, to create safety sets in the joint strategy space for decentralized learning. Upon verification of the direction of learning dynamics, the resulting trajectories are guaranteed not to escape such sets, during the learning process. As a result, it is ensured, that despite the uncertainty over convergence of the applied algorithms, learning will never form hazardous joint strategy combinations. ...

Multi-agent MDP homomorphic networks

Conference paper (2022) - Elise van der Pol, Herke van Hoof, Frans A. Oliehoek, Max Welling

This paper introduces Multi-Agent MDP Homomorphic Networks, a class of networks that allows distributed execution using only local information, yet is able to share experience between global symmetries in the joint state-action space of cooperative multi-agent systems. In cooperative multi-agent systems, complex symmetries arise between different configurations of the agents and their local observations. For example, consider a group of agents navigating: rotating the state globally results in a permutation of the optimal joint policy. Existing work on symmetries in single agent reinforcement learning can only be generalized to the fully centralized setting, because such approaches rely on the global symmetry in the full state-action spaces, and these can result in correspondences across agents. To encode such symmetries while still allowing distributed execution we propose a factorization that decomposes global symmetries into local transformations. Our proposed factorization allows for distributing the computation that enforces global symmetries over local agents and local interactions. We introduce a multi-agent equivariant policy network based on this factorization. We show empirically on symmetric multi-agent problems that globally symmetric distributable policies improve data efficiency compared to non-equivariant baselines. ...

Best-Response Bayesian Reinforcement Learning with Bayes-adaptive POMDPs for Centaurs

Conference paper (2022) - Mustafa Mert Çelikok, Frans A. Oliehoek, Samuel Kaski

Centaurs are half-human, half-AI decision-makers where the AI's goal is to complement the human. To do so, the AI must be able to recognize the goals and constraints of the human and have the means to help them. We present a novel formulation of the interaction between the human and the AI as a sequential game where the agents are modelled using Bayesian best-response models. We show that in this case the AI's problem of helping bounded-rational humans make better decisions reduces to a Bayes-adaptive POMDP. In our simulated experiments, we consider an instantiation of our framework for humans who are subjectively optimistic about the AI's future behaviour. Our results show that when equipped with a model of the human, the AI can infer the human's bounds and nudge them towards better decisions. We discuss ways in which the machine can learn to improve upon its own limitations as well with the help of the human. We identify a novel trade-off for centaurs in partially observable tasks: for the AI's actions to be acceptable to the human, the machine must make sure their beliefs are sufficiently aligned, but aligning beliefs might be costly. We present a preliminary theoretical analysis of this trade-off and its dependence on task structure. ...

Influence-Augmented Local Simulators

A Scalable Solution for Fast Deep RL in Large Networked Systems

Conference paper (2022) - Miguel Suau, Jinke He, Matthijs T.J. Spaan, Frans A. Oliehoek

Learning effective policies for real-world problems is still an open challenge for the field of reinforcement learning (RL). The main limitation being the amount of data needed and the pace at which that data can be obtained. In this paper, we study how to build lightweight simulators of complicated systems that can run sufficiently fast for deep RL to be applicable. We focus on domains where agents interact with a reduced portion of a larger environment while still being affected by the global dynamics. Our method combines the use of local simulators with learned models that mimic the influence of the global system. The experiments reveal that incorporating this idea into the deep RL workflow can considerably accelerate the training process and presents several opportunities for the future. ...

Influence-aware memory architectures for deep reinforcement learning in POMDPs

Journal article (2022) - Miguel Suau , Jinke He, Elena Congeduti, Rolf Starre, Aleksander Czechowski, Frans A. Oliehoek

Due to its perceptual limitations, an agent may have too little information about the environment to act optimally. In such cases, it is important to keep track of the action-observation history to uncover hidden state information. Recent deep reinforcement learning methods use recurrent neural networks (RNN) to memorize past observations. However, these models are expensive to train and have convergence difficulties, especially when dealing with high dimensional data. In this paper, we propose influence-aware memory, a theoretically inspired memory architecture that alleviates the training difficulties by restricting the input of the recurrent layers to those variables that influence the hidden state information. Moreover, as opposed to standard RNNs, in which every piece of information used for estimating Q values is inevitably fed back into the network for the next prediction, our model allows information to flow without being necessarily stored in the RNN’s internal memory. Results indicate that, by letting the recurrent layers focus on a small fraction of the observation variables while processing the rest of the information with a feedforward neural network, we can outperform standard recurrent architectures both in training speed and policy performance. This approach also reduces runtime and obtains better scores than methods that stack multiple observations to remove partial observability. ...

Speeding up Deep Reinforcement Learning through Influence-Augmented Local Simulators

Conference paper (2022) - Miguel Suau, Jinke He, Matthijs T.J. Spaan, Frans A. Oliehoek

Learning effective policies for real-world problems is still an open challenge for the field of reinforcement learning (RL). The main limitation being the amount of data needed and the pace at which that data can be obtained. In this paper, we study how to build lightweight simulators of complicated systems that can run sufficiently fast for deep RL to be applicable. We focus on domains where agents interact with a reduced portion of a larger environment while still being affected by the global dynamics. Our method combines the use of local simulators with learned models that mimic the influence of the global system. The experiments reveal that incorporating this idea into the deep RL workflow can considerably accelerate the training process and presents several opportunities for the future. ...

Model-Based Reinforcement Learning with State Abstraction: A Survey

Conference paper (2022) - R.A.N. Starre, M. Loog, F.A. Oliehoek

Model-based reinforcement learning methods are promising since they can increase sample efficiency while simultaneously improving generalizability. Learning can also be made more efficient through state abstraction, which delivers more compact models. Model-based reinforcement learning methods have been combined with learning abstract models to profit from both effects. We consider a wide range of state abstractions that have been covered in the literature, from straightforward state aggregation to deep learned representations, and sketch challenges that arise when combining model-based reinforcement learning with abstraction. We further show how various methods deal with these challenges and point to open questions and opportunities for further research. ...

A Cross-Field Review of State Abstraction for Markov Decision Processes

Conference paper (2022) - E. Congeduti, F.A. Oliehoek

Complex real-world systems pose a significant challenge to decision making: an agent needs to explore a large environment, deal with incomplete or noisy information, generalize the experience and learn from feedback to act optimally. These processes demand vast representation capacity, thus putting a burden on the agent’s limited computational and storage resources. State abstraction enables effective solutions by forming concise representations of the agents world. As such, it has been widely investigated by several research communities which have produced a variety of different approaches. Nonetheless, relations among them still remain unseen or roughly defined. This hampers potential applications of solution methods whose scope remains limited to the specific abstraction context for which they have been designed. To this end, the goal of this paper is to organize the developed approaches and identify connections between abstraction schemes as a fundamental step towards methods generalization. As a second contribution we discuss general abstraction properties with the aim of supporting a unified perspective for state abstraction. ...