J.W. Böhmer | TU Delft Repository

TransZero: Parallel Tree Expansion in MuZero using Transformer Networks

Master thesis (2025) - E.L. Malmsten (author) , J.W. Böhmer (mentor) , T.J. Viering (graduation committee member)

Over the past decade, model-based reinforcement learning (MBRL) has become a leading approach for solving complex decision-making problems. A prominent algorithm in this domain is MuZero, which integrates Monte Carlo Tree Search (MCTS) with deep neural networks and a latent world ...

Analyzing Plasticity Through Utility Scores

Comparing Continual Learning Algorithms via Utility Score Distributions

Bachelor thesis (2025) - A. Lenkšas (author) , J.W. Böhmer (mentor) , L.R. Engwegen (mentor) , M. Khosla (graduation committee member)

One of the central problems in continual learning is the loss of plasticity, which is the model’s inability to learn new tasks. Several approaches have been previously proposed, such as Continual Backpropagation (CBP). This algorithm uses utility scores, which represent how usefu ...

Evaluating Catastrophic Forgetting in Neural Networks Trained with Continual Backpropagation

Bachelor thesis (2025) - J. Jučas (author) , L.R. Engwegen (mentor) , J.W. Böhmer (mentor)

Continual Backpropagation (CBP) has recently been proposed as an effective method for mitigating loss of plasticity in neural networks trained in continual learning (CL) settings. While extensive experiments have been conducted to demonstrate the algorithm's ability to mitigate l ...

Layerwise Perspective into Continual Backpropagation

Replacing the First Layer is All You Need

Bachelor thesis (2025) - A. Jučas (author) , J.W. Böhmer (mentor) , L.R. Engwegen (mentor) , M. Khosla (graduation committee member)

Continual learning faces a problem, known as plasticity loss, where models gradually lose the ability to adapt to new tasks. We investigate Continual Backpropagation (CBP) – a method that tackles plasticity loss by constantly resetting a small fraction of low-utility neurons. We ...

Exploring Alternatives to Full Neuron Reset for Maintaining Plasticity in Continual Backpropagation

Bachelor thesis (2025) - U. Urbonavičiūtė (author) , L.R. Engwegen (mentor) , J.W. Böhmer (mentor) , M. Khosla (graduation committee member)

Deep learning systems are typically trained in static environments and fail to adapt when faced with a continuous stream of new tasks. Continual learning addresses this by allowing neural networks to learn sequentially without forgetting prior knowledge. However, such models ofte ...

Maintaining Plasticity for Deep Continual Learning

Activation Function-Adapted Parameter Resetting Approaches

Bachelor thesis (2025) - V. Purice (author) , L.R. Engwegen (mentor) , J.W. Böhmer (mentor) , M. Khosla (graduation committee member)

Standard deep learning utensils, in particular feed-forward artificial neural networks and the backpropagation algorithm, fail to adapt to sequential learning scenarios, where the model is continuously presented with new training data. Many algorithms that aim to solve this probl ...

Change of Plans! Adaptive AlphaZero Planning Methods for Novel Test Environments

Master thesis (2025) - I. Tamassia (author) , Wendelin Böhmer (mentor) , Anna Lukina (graduation committee member)

AlphaZero and its successors employ learned value and policy functions to enable more efficient and effective planning at deployment. A standard assumption is that the agent will be deployed in the same environment where these estimators were trained; changes to the environment w ...

Using Sparse Transformers as World Models to Improve Generalization

Master thesis (2025) - A. Ebersberger (author) , J.W. Böhmer (mentor) , Pradeep Murukannaiah (graduation committee member)

This thesis introduces a novel sparsity-regularized transformer to be used as a world model in model-based reinforcement learning, specifically targeting environments with sparse interactions. Sparse-interactive environments are a class of environments where the state can be deco ...

Motion Planning in Dynamic Environments with Learned Scalable Policies

Doctoral thesis (2025) - A. Serra Gomez (author) , J. Alonso-Mora (promotor) , J.W. Böhmer (copromotor)

The application of multi-robot systems has gained popularity in recent years. Multi-robot systems show great potential in scaling up robotic applications in surveillance, monitoring, and exploration. Although single robots can already be used to automatize search and rescue, and ...

The application of multi-robot systems has gained popularity in recent years. Multi-robot systems show great potential in scaling up robotic applications in surveillance, monitoring, and exploration. Although single robots can already be used to automatize search and rescue, and surveillance tasks, their capability to solve their given task is still dependent on the region of interest size. For example, in expansive environments, a single robot’s ability to cover or surveil the entire area effectively diminishes, resulting in information gaps. To this end, this thesis aims to improve multi-robot coordination and navigation in active perception tasks, e.g. exploration and surveillance. The multi-robot coordination problem aims to determine when and with whom multiple robots should exchange information to avoid collisions while navigating to their goal. The active perception problem involves guiding a sensing robot to positions rich in task-relevant information. The primary objective of this dissertation is to provide algorithmic contributions, specifically focusing on the obtention of robust policies that can adapt and scale to fleets and tasks of varying sizes. To this end, this thesis leverages the combination of high-level policies, learned through Reinforcement Learning, with robust low-level optimal control policies. Firstly, the communication-based coordination problem for decentralized multi-robot systems is formally defined and modeled as a Markov Decision Process. This thesis then proposes a novel communication policy for decentralized multi-robot systems, improving coordination and collision avoidance. The proposed policy, learned via Reinforcement Learning, allows robots to selectively communicate, requesting trajectory plans from potential risks while assuming constant velocities for others. This utilizes an attention-based neural network architecture for scalability, integrated with Non-Linear Model Predictive Control for safe and robust motion planning. When tested with 12 robots, it reduced communications compared to alternatives, maintaining safety. Scalable and robust, it performed well with different team sizes and in the presence of observation noise. Real-world tests on quadrotors confirmed its practical applicability. The Active Perception problem represents the second major challenge addressed in this dissertation. Specifically, this thesis addresses the challenge of using a drone to collect semantic information for classifying multiple moving targets, focusing on computing control inputs for optimal viewpoints. This task is complicated by the variable amount of targets to classify in the region of interest, and the use of a ”black-box" classifier, like a deep learning neural network, which lacks clear analytical relationships between viewpoints and outputs. This thesis proposes an attention-based architecture, trained with Reinforcement Learning (RL), which determines the best viewpoints for the drone to gather evidence from multiple unclassified targets, considering their movement, orientation, and potential occlusions. A low-level MPC controller then guides the drone to these viewpoints. The approach outperforms various baselines and shows adaptability to new scenarios and scalability to numerous targets with varying movement dynamics. To conclude the thesis, the previous approach is extended to more realistic and larger environments where targets need to be localized, tracked, and then classified. This thesis introduces a novel decentralized hybrid multi-camera system designed for surveillance and monitoring applications. Traditional fixed camera networks suffer from blind spots and backlighting issues. A decentralized hybrid framework is proposed that integrates both static and mobile cameras to actively and dynamically enhance critical information gathering. All networked cameras collaborate to monitor and localize people in the environment by comparing their local information. The mobile camera is guided by a viewpoint control policy to maximize semantic information from observed targets. The framework was implemented in a photorealistic environment using Unreal Engine and enabled distributed communications through the Robot Operating System (ROS), bridging the gap between simulation and real-world applications. Results in large environments demonstrate the advantages of collaborative mobile cameras over static and individual setups both in target identification and tracking accuracy, respectively. In crowded scenarios, mobile cameras excel in avoiding occlusions and capturing desired viewpoints, improving the percentage of classified tracked targets compared to static setups. Qualitatively, mobile cameras provide superior target observation quality unmatched by the static framework. In summary, this thesis makes significant contributions that are validated through extensive evaluations in simulated photo-realistic environments and with commercial drones, demonstrating the potential for practical applications. Despite the progress, the thesis acknowledges the remaining challenges in deploying multi-robot systems in realworld perception tasks, especially when the policy is learned, and suggests directions for future research.

Optimization under Uncertainty through Problem Reformulations

Doctoral thesis (2025) - G. Veviurko (author) , M. De Weerdt (promotor) , J.W. Böhmer (copromotor)

The research in this thesis falls within the realm of optimization under uncertainty, a crucial area in computer science and mathematics with broad applications in power systems, finance, machine learning, healthcare, and more. This thesis presents three main contributions across ...

Investigation into the Effect of Replay Buffer Diversity on Generalizability

Master thesis (2024) - F. Kaubek (author) , Wendelin Böhmer (mentor) , D. M J Tax (graduation committee member)

In reinforcement learning, the ability to generalize to unseen situations is pivotal to an agent’s success. In this thesis, two novel methods that aim to enhance the generalizability of an agent will be introduced. Both of the methods rely on the idea that the diversity of a re ...

Gradient based adversarial domain randomization

Master thesis (2024) - G. Koning (author) , Matthijs Spaan (mentor) , Wendelin Böhmer (mentor) , D.S. van der Heijden (mentor)

Recent advancements in differential simulators offer a promising approach to enhancing the sim2real transfer of reinforcement learning (RL) agents by enabling the computation of gradients of the simulator’s dynamics with respect to its parameters. However, the application of thes ...

General Tree Evaluation for AlphaZero

Master thesis (2024) - R.A. Jaldevik (author) , Wendelin Böhmer (mentor) , Neil Yorke-Smith (graduation committee member)

Over the last decade, there have been significant advances in model-based deep reinforcement learning. One of the most successful such algorithms is AlphaZero which combines Monte Carlo Tree Search with deep learning. AlphaZero and its successors commonly describe a unified frame ...

Reward Based Program Synthesis for Minecraft

Adapting Program Synthesizers for Reward Evaluation and Leveraging Discovered Programs

Bachelor thesis (2024) - T.I. Mukminov (author) , T.R. Hinnerichs (mentor) , Sebastijan Dumančić (mentor) , J.W. Böhmer (mentor)

Program synthesis is the task to construct a program that provably satisfies a given high-level specification. There are various ways in which a specification can be described. This research focuses on adapting the Probe synthesizer, traditionally reliant on input-output examples ...

Revisiting Mirai

Characterising botnet scans through network telescope traffic

Master thesis (2024) - M.M. Ali (author) , G. Smaragdakis (mentor) , Harm J. Griffioen (graduation committee member) , Wendelin Böhmer (coach) , Francisco Dominguez (coach)

DMQL: Deep Maximum Q-Learning

Combatting Relative Overgeneralisation in Deep Independent Learners using Optimism and Similarity

Master thesis (2022) - E.S. Dam (author) , J.W. Böhmer (mentor) , Matthijs T.J. Spaan (mentor) , Frans A Oliehoek (graduation committee member)

Various pathologies can occur when independent learners are used in cooperative Multi-Agent Reinforcement Learning. One such pathology is Relative Overgeneralisation, which manifests when a suboptimal Nash Equilibrium in the joint action space of a problem is preferred over an op ...

Various pathologies can occur when independent learners are used in cooperative Multi-Agent Reinforcement Learning. One such pathology is Relative Overgeneralisation, which manifests when a suboptimal Nash Equilibrium in the joint action space of a problem is preferred over an optimal Equilibrium. Approaches exist to combat relative overgeneralisation in Q-Learning problems, yet many approaches do not scale well with the state space or joint action space, are hard to adapt or configure, or are not applicable in partially observable environments.

In this work, we introduce Deep Maximum Q-Learning (DMQL), a methodology combining Deep Recurrent Q-Networks [Hausknecht & Stone, 2015] and the optimistic assumption which can be found in Distributed Q-Learning [Lauer & Riedmiller, 2000]. DMQL is a maximum-based learning technique which can be scheduled to transition to an average-based learner (or any other arbitrary type of learner), which can utilise independent learners without communication. DMQL is designed to be relatively intuitive and easy to adapt and configure and is able to utilise notions of similarity to provide solutions in large and continuous state spaces.

DMQL clusters similar histories by mapping them to the same hash based on a subset of the information contained within them, such as the current observation, or other related available information sources, such as state information. Using these hashes, DMQL constructs a hash-action pseudo-maximum Q-value estimation dictionary which is updated at every gradient update step. A dictionary value degradation technique ensures stability by preventing overestimations from being retained in the dictionary by decaying them after they have been encountered. This way, optimism is introduced, and relative overgeneralisation is prevented without using true maximums of past Q-value estimates, as these are not guaranteed to be indicative of the real optimal Q-values. Contrasting similar deep learning methodologies [Palmer et al., 2017], DMQL augments Deep Q-Network targets through value replacement instead of value discardment, potentially leading to improved efficiency. In addition, DMQL can be adapted to be utilised as a maximisation-based step in the greater learning process of other deep learning algorithms.

Our experimental results indicate that DMQL is a successful extension of Distributed Q-learning, which can be used in small environments even without the usage of similarity. Using similarity, however, grants us the ability to learn in increasingly large and complex environments. Interestingly, various problems exist within the process of developing a suitable manner of incorporating similarity into hashes. We speculate on how these problems can be prevented or circumvented, and our experiments validate our circumvention methods. Lastly, our experiments show that DMQL can successfully be applied to combat relative overgeneralisation in partially observable environments as well.

Deep Exploration by Planning With Uncertainty in Deep Model Based Reinforcement Learning

Master thesis (2022) - Y. Oren (author) , Wendelin Böhmer (mentor) , Matthijs T. J. Spaan (mentor)

Deep, model based reinforcement learning has shown state of the art, human-exceeding performance in many challenging domains.
Low sample efficiency and limited exploration remain however as leading obstacles in the field.
In this work, we incorporate epistemic uncertain ...

Generalization and Data Transformation Invariance of Visual Attention Models

Bachelor thesis (2022) - P.G.M. de Kruijff (author) , J.W. Böhmer (mentor) , C.B. Poulsen (mentor)

This paper compares the generalizing capability of multi-head attention (MHA) models with that of convolutional neural networks (CNNs). This is done by comparing their performance on out-ofdistribution data. The dataset that is used to train both models is created by coupling dig ...

Generalization by Visual Attention

Bachelor thesis (2022) - B.J. Collé (author) , J.W. Böhmer (mentor) , C.B. Poulsen (graduation committee member)

Most deep learning models fail to generalize in production. Indeed, sometimes data used during training does not completely reflect the deployed environment. The test data is then considered out-of-distribution compared to the training data. In this paper, we focus on out-of-dist ...

On the Regularization of Convolutional Neural Networks and Transformers under Distribution Shifts

Bachelor thesis (2022) - L.Z. Assini (author) , J.W. Böhmer (mentor) , C.B. Poulsen (graduation committee member)

The use of Transformers outside the realm of natural language processing is becoming more and more prevalent. Already in the classification of data sets such as CIFAR-100 it has shown to be able to perform just as well as the much more established Convolutional Neural Network. Th ...