F.A. Oliehoek | TU Delft Repository

Detecting Environment Changes via Quantile Spread in Quantile Regression Deep-Q Networks

Bachelor thesis (2025) - P. Stan (author) , M.M. Celikok (mentor) , Frans A Oliehoek (mentor) , Annibale Panichella (graduation committee member)

Reinforcement learning agents are trained in well-defined environments and evaluated under the assumption that the test time conditions match those encountered during training. However, even small changes in the environment’s dynamics can degrade the policy’s performance, even mo ...

Evaluating and Enhancing the Robustness of Proximal Policy Optimization to Test-Time Corruptions in Sequential Domains

Bachelor thesis (2025) - M.R. Rodić (author) , M.M. Celikok (mentor) , Frans A Oliehoek (mentor) , Annibale Panichella (graduation committee member)

Reinforcement learning (RL) agents often achieve impressive results in simulation but can fail catastrophically when facing small deviations at deployment time. In this work, we examine the brittleness of Proximal Policy Optimization (PPO) agents when subjected to test-time obser ...

Evaluating the Robustness of SAC under Distributional Shifts in Driving Domain

Bachelor thesis (2025) - L. Polovina (author) , F.A. Oliehoek (mentor) , M.M. Celikok (mentor)

Reinforcement Learning (RL) has shown strong potential in complex decision-making domains, but its likelihood to distributional shifts between training and deployment environments remains a significant barrier to real-world reliability, particularly in safety-critical contexts su ...

Evaluating the robustness of DQN and QR-DQN under domain randomization

Analyzing the effects of domain variation on value-based methods

Bachelor thesis (2025) - Y. Zwetsloot (author) , M.M. Celikok (mentor) , Frans A Oliehoek (mentor) , Annibale Panichella (graduation committee member)

Domain randomization (or DR) is a widely used technique in reinforcement learning to improve robustness and enable sim-to-real transfer. While prior work has focused extensively on DR in combination with algorithms such as PPO and SAC, its effects on value-based methods like DQN ...

Action Sampling Strategies in Sampled MuZero for Continuous Control

A JAX-Based Implementation with Evaluation of Sampling Distributions and Progressive Widening

Bachelor thesis (2025) - V. Kuboň (author) , J. He (mentor) , Frans A Oliehoek (mentor) , M. Weinmann (graduation committee member)

This work investigates the impact of action sampling strategies on the performance of Sampled MuZero, a reinforcement learning algorithm designed for continuous control settings like robotics. In contrast to discrete domains, continuous action spaces require sampling from a propo ...

Conditional Normalizing Flows for Modeling Environment Stochasticity

Using a MuZero-based learned model

Bachelor thesis (2025) - B.D. Damian (author) , Frans Oliehoek (mentor) , J. He (mentor) , Michael Weinmann (graduation committee member)

Planning agents have demonstrated superhuman performance in deterministic environments, such as chess and Go, by combining end-to-end reinforcement learning with powerful tree-based search algorithms. To extend such agents to stochastic or partially observable domains, Stochastic ...

Exploring Attention Mechanisms in Transformers for Data-Efficient Model-Based Reinforcement Learning

Bachelor thesis (2025) - D. De Dios Allegue (author) , Frans A Oliehoek (mentor) , J. He (mentor) , M. Weinmann (graduation committee member)

A key advancement in model-based Reinforcement Learning (RL) stems from Transformer
based world models, which allow agents to plan effectively by learning an internal represen
tation of the environment. However, causal self-attention in Transformers can be computa
tio ...

The impact of model learning losses on the sample efficiency of MuZero in Atari

Bachelor thesis (2025) - D.I. Popovici (author) , J. He (mentor) , Frans A Oliehoek (mentor) , M. Weinmann (graduation committee member)

Recent advances in reinforcement learning (RL) have achieved superhuman performance in various domains but often rely on vast numbers of environment interactions, limiting their practicality in real-world scenarios. MuZero is a RL algorithm that uses Monte Carlo Tree Search with ...

Exploring Learned Abstract Models For Efficient Planning and Learning

Doctoral thesis (2025) - J. He (author) , F.A. Oliehoek (promotor) , CM Jonker (promotor)

This thesis investigates the role of learned abstract models in online planning and model-based reinforcement learning (MBRL). We explore how abstract models can accelerate search in online planning and evaluate their effectiveness in supporting policy evaluation and improvement ...

This thesis investigates the role of learned abstract models in online planning and model-based reinforcement learning (MBRL). We explore how abstract models can accelerate search in online planning and evaluate their effectiveness in supporting policy evaluation and improvement in MBRL.

In online planning, we focus on reducing the high computational cost of simulating large, factored, partially observable environments. In Chapter 3, we introduce the influence-augmented local simulator (IALS), which approximates external influences while preserving local agent interactions. By replacing the full simulator with IALS, we enable faster planning while maintaining decision quality. We propose a two-phase approach where the influence model is trained offline and later integrated into planning, allowing significantly more simulations within a fixed computational budget. However, this approach has limitations, including potential distribution shifts and the risk of poor generalization.

To address these issues, Chapter 4 introduces the self-improving simulator, which eliminates offline training by learning the abstract model online during planning. A simulator selection mechanism dynamically balances the use of the learned and original simulators, improving computational efficiency over time while ensuring planning accuracy. Our results show that this approach avoids distribution shift issues, prevents premature reliance on inaccurate models, and removes the delay associated with offline training.

In MBRL, we examine the effectiveness of MuZero’s learned model in supporting policy evaluation and improvement. In Chapter 5, we analyze how well MuZero’s model generalizes beyond its training distribution and find that it struggles to support planning "outside the box" due to accumulated model inaccuracies. However, we show that MuZero’s learned policy prior mitigates these errors by guiding the search toward regions where the model is more reliable. This insight highlights the dual role of the policy prior—not only improving search efficiency but also compensating for model imperfections, contributing to MuZero’s strong empirical performance.

Overall, this thesis advances the understanding of learned abstract models in sequential decision-making, demonstrating their potential to improve computational efficiency while identifying key limitations in their ability to support planning. We hope these findings encourage further research into abstraction-driven approaches for adaptive, scalable decision-making in complex environments.

Risk-sensitive Reinforcement Learning for Portfolio Allocation

Master thesis (2024) - A.A. Sinha (author) , FA Oliehoek (mentor) , Luciano Cavalcante Siebert (graduation committee member) , A. Papapantoleon (graduation committee member) , M.M. Celikok (graduation committee member) , Rob Huisman (graduation committee member)

This study explores the application of risk-sensitive Reinforcement Learning (RL) in portfolio optimization, aiming to integrate asset pricing and portfolio construction into a unified, end-to-end RL framework. While RL has shown promise in various domains, its traditional risk-n ...

Influence Based Multi Agent Reinforcement Learning for Active Wake Control

Using influence to increase energy production using multi agent reinforcement learning

Master thesis (2024) - M.K. Plesner (author) , FA Oliehoek (mentor) , Mathijs M. de de Weerdt (graduation committee member) , G. Neustroev (mentor)

The increasing demand for electricity has lead to demand for more efficient energy production. One promising option is wind power, which currently provides an estimated 7.8% of the world’s energy production. One of the problems with wind energy is that a small percentage of ...

The increasing demand for electricity has lead to demand for more efficient energy production. One promising option is wind power, which currently provides an estimated 7.8% of the world’s energy production. One of the problems with wind energy is that a small percentage of the energy is lost due to the wake effect. The wake of a wind turbine is an area of low wind speed and high turbulence which is caused by the spinning of the turbine. This wake effect can mitigated by active wake control, which is a process by which the wake from a turbine is redirected away from downwind turbines, by changing the yaw of the turbine head. Calculating a policy for doing this is computationally expensive to do using numerical optimisation. Therefore, multi agent reinforcement learning is proposed to learn a policy which performs active wake control.
The proposed approach makes use of the popular reinforcement learning algorithm REINFORCE, and extends it using a variety of methods. First, a simplified version of the problem is treated, wherein the wind direction is fixed. Then the problem is made more realistic by introducing changing wind directions. The first extension of REINFORCE that is treated is difference rewards, a reward shaping strategy which seeks to solve the credit assignment problem, thereby improving cooperation between turbines. The second method uses training regimes, which train different agents at different times to stabilise the environment as much as possible. Next, role-based reinforcement learning is used to conteract the complexity of the problem by allowing each agent to specialise for a certain role. Finally, since roles cannot be manually determined for larger farms, influence-based abstraction is used to enable agents to learn the roles themselves, by abstracting spacial information and presenting it to the agent as an observation.
The results demonstrate that multi agent reinforcement learning can be used to perform active wake control in wind farms. Furthermore, the extensions proposed are shown to improve learning, and lead to greater energy output. While multi agent reinforcement learning is shown to be a promising way to tackle active wake control in wind farms, research is needed to improve the stability of the learned policies.

The Effect of State-visitation Mismatch on Off-policy Performance in Behaviour-agnostic Reinforcement Learning

Bachelor thesis (2024) - Kevin C. Chen (author) , S.R. Bongers (mentor) , FA Oliehoek (mentor) , CM Jonker (graduation committee member)

Off-policy evaluation has some key problems with one of them being the “curse of horizon”. With recent breakthroughs [1] [2], new estimators have emerged that utilise importance sampling of the individual state-action pairs and reward rather than over the whole trajectory. With t ...

The Impact of Initial Start Distribution Mismatch on Policy Evaluation in Behavior-agnostic Reinforcement Learning

Bachelor thesis (2024) - T. Sabău (author) , FA Oliehoek (mentor) , S.R. Bongers (mentor) , Catholijn Jonker (graduation committee member)

Behavior-agnostic reinforcement learning is a rapidly expanding research area focusing on developing algorithms capable of learning effective policies without explicit knowledge of the environment's dynamics or specific behavior policies. It proposes robust techniques to perform ...

Impact of State Visitation Mismatch Methods on the Performance of On-Policy Reinforcement Learning

Bachelor thesis (2024) - H. Cho (author) , F.A. Oliehoek (mentor) , S.R. Bongers (mentor) , CM Jonker (graduation committee member)

In the field of reinforcement learning (RL), effectively leveraging behavior-agnostic data to train and evaluate policies without explicit knowledge of the behavior policies that generated the data is a significant challenge. This research investigates the impact of state visitat ...

Use of sample-splitting and cross-fitting techniques to mitigate the risks of double-dipping in behaviour-agnostic reinforcement learning

Comparative Analysis

Bachelor thesis (2024) - Y. Aslan (author) , S.R. Bongers (mentor) , F.A. Oliehoek (mentor) , CM Jonker (graduation committee member)

This paper addresses the issue of double-dipping in off-policy evaluation (OPE) in behaviour-agnostic reinforcement learning, where the same dataset is used for both training and estimation, leading to overfitting and inflated performance metrics especially for variance. We intro ...

SimuDICE: Offline Policy Optimization Through Iterative World Model Updates and DICE Estimation

Bachelor thesis (2024) - C. Brita (author) , F.A. Oliehoek (mentor) , S.R. Bongers (mentor) , CM Jonker (graduation committee member)

In offline reinforcement learning, deriving a policy from a pre-collected set of experiences is challenging due to the limited sample size and the mismatched state-action distribution between the target policy and the behavioral policy that generated the data. Learning a dynamic ...

See Clearly, Act Intelligently: Transformers in Transparent Environments

Bachelor thesis (2024) - O. Elamin (author) , J. He (mentor) , Frans Oliehoek (mentor) , M.M. de Weerdt (graduation committee member)

Traditionally, Recurrent Neural Networks (RNNs) are used to predict the sequential dynamics of the environment. With the advancement and breakthroughs of Transformer models, there has been demonstrated improvement in the performance & sample efficiency of Transformers as worl ...

Understanding the Effects of Discrete Representations in Model-Based Reinforcement Learning

An analysis on the effects of categorical latent space world models on the MinAtar Environment

Bachelor thesis (2024) - M. Mitrea (author) , Frans Oliehoek (mentor) , J. He (mentor) , M.M. de Weerdt (graduation committee member)

While model-free reinforcement learning (MFRL) approaches have been shown effective at solving a diverse range of environments, recent developments in model-based reinforcement learning (MBRL) have shown that it is possible to leverage its increased sample efficiency and generali ...

Task-Unaware Lifelong Robot Learning with Retrieval-based Weighted Local Adaptation

Master thesis (2024) - P. Yang (author) , C. Wang (mentor) , Jens Kober (mentor) , Frans Oliehoek (mentor) , C.A. Raman (graduation committee member)

Real-world environments require robots to continuously acquire new skills while retain-ing previously learned abilities, all without the need for clearly defined task boundaries. Storing all past data to prevent forgetting is impractical due to storage and privacy con-cerns. To a ...

Leveraging Factored State Representations for Enhanced Efficiency in Reinforcement Learning

Doctoral thesis (2024) - M. Suau de Castro (author) , Frans Oliehoek (promotor) , M. T.J. Spaan (promotor)

Reinforcement learning techniques have demonstrated great promise in tackling sequential decision-making problems. However, the inherent complexity of real-world scenarios presents significant challenges for its application. This thesis takes a fresh approach that explores the un ...