J.W. Böhmer | TU Delft Repository

Deep Reinforcement Learning for Multi-Objective Airport Ground Handling

Master thesis (2025) - S.M. Shah (author) , N. Yorke-Smith (mentor) , Yaoxin Wu (mentor) , J.W. Böhmer (graduation committee member)

Air transport has enormous impact on economic, social, and environmental factors worldwide. According to the International Air Transport Association significant year on year increases can be noticed recently, in both passenger and cargo traffic. However, with this increasing dema ...

Decentralized Real-Time Planning for Multi-UAV Cooperative Manipulation via Imitation Learning

Master thesis (2025) - S. Agarwal (author) , S. Sun (mentor) , Javier Alonso-Mora (mentor) , Jens Kober (graduation committee member) , Wendelin Böhmer (graduation committee member)

Collaborative transportation and manipulation of cable-suspended loads by multiple UAVs offer a promising way for expanding UAVs’ role in heavy-lifting operations. Existing approaches for collaborative aerial manipulation of a payload along a reference trajectory typically rely e ...

Transformers can do Bayesian Clustering

Master thesis (2025) - P. Bhaskaran (author) , Tom Viering (mentor) , O.K. Shirekar (mentor) , Marcel Reinders (graduation committee member) , J.W. Böhmer (graduation committee member)

Motivation: Clustering is an unsupervised learning task with broad applications. Traditional clustering methods often rely on point estimates of model parameters, which can limit their ability to capture uncertainty. Bayesian clustering addresses this by incorporating unce ...

Exploring the Synergy between Inverse Reinforcement Learning and Reinforcement Learning From Human Feedback for Query Reduction

Bachelor thesis (2024) - A. Batrineanu (author) , L. Cavalcante Siebert (mentor) , A. Mone (mentor) , Wendelin Böhmer (graduation committee member)

Reinforcement Learning is a powerful tool for problems that require sequential-decision-making. However, it often faces challenges due to the extensive need for reward engineering. Reinforcement Learning from Human Feedback (RLHF) and Inverse Reinforcement Learning (IRL) hold the ...

The Role of Feedback Variety in Reinforcement Learning from Human Feedback

Bachelor thesis (2024) - I. Makarov (author) , Luciano Cavalcante Siebert (mentor) , A. Mone (mentor) , J.W. Böhmer (graduation committee member)

Reinforcement Learning from Human Feedback (RLHF) offers a powerful approach to training agents in environments where defining an explicit reward function is challenging by learning from human feedback provided in various forms. This research evaluates three common feedback types ...

The Human Factor: Addressing Diversity in Reinforcement Learning from Human Feedback

How can RLHF deal with possibly conflicting feedback?

Bachelor thesis (2024) - J. PAEZ FRANCO (author) , A. Mone (mentor) , L. Cavalcante Siebert (mentor) , Wendelin Böhmer (graduation committee member)

Reinforcement Learning from Human Feedback (RLHF) is a promising approach to training agents to perform complex tasks by incorporating human feedback. However, the quality and diversity of this feedback can significantly impact the learning process. Humans are highly diverse in t ...

Decreasing the number of demonstrations required for Inverse Reinforcement Learning by integrating human feedback

Bachelor thesis (2024) - Z. Oğurlu (author) , L. Cavalcante Siebert (mentor) , A. Mone (mentor) , Wendelin Böhmer (graduation committee member)

The main concept behind reinforcement learning is that an agent takes certain actions and is rewarded or punished for these actions. However, the rewards that are involved when performing a certain task can be quite complicated in real life and the contribution of different facto ...

The main concept behind reinforcement learning is that an agent takes certain actions and is rewarded or punished for these actions. However, the rewards that are involved when performing a certain task can be quite complicated in real life and the contribution of different factors in the reward function is often unknown. From this problem emerges reward learning, which is the process of learning the reward function of an environment. There are several techniques for performing reward learning. We can view these different techniques within 2 different high-level categories: Learning from demonstrations and learning from feedback. IRL (Inverse Reinforcement Learning) is a way of learning from demonstrations. Meanwhile, RLHF (Reinforcement Learning from Human Feedback) is a way of learning from feedback.

In this paper, we are proposing the approach of training a reward learning agent, first with IRL and then with RLHF. IRL provides the benefit of learning a reward function quite quickly, however, it can suffer from the presence of sub-optimal demonstrations from the expert. Meanwhile, RLHF is slower at learning the reward function from scratch. Hence, we are proposing an approach where we integrate RLHF as a way to fine-tune the initial reward function calculated by IRL. By doing so, we are aiming to alleviate the negative effect of sub-optimal expert demonstrations on IRL.

We test and evaluate our methodology on the cart pole environment from the seals library. We compare the results from our approach to reward learning from only expert demonstrations, without integrating human feedback (i.e. only IRL). The obtained results suggest that, RLHF might in fact not be a good complement for IRL, specifically when we have sub-optimal expert demonstrations. In fact, we found that applying RLHF on top of IRL can even drop the performance of the resulting reward function, which challenges our initial hypothesis regarding the complementarity between these two methods.

Protein Structure and Sequence Co-Design through Graph Based Generative Diffusion Modeling

Master thesis (2024) - M.H. Bhuradia (author) , J.M. Weber (mentor) , Hadi Jamali-Rad (mentor) , Amelia Villegas Morcillo (mentor) , Marcel Reinders (graduation committee member) , J.W. Böhmer (graduation committee member)

Proteins are fundamental biological macromolecules essential for cellular structure, enzymatic catalysis, and immune defense, making the generation of novel proteins crucial for advancements in medicine, biotechnology, and material sciences. This study explores protein design usi ...

Conflict in the World of Inverse Reinforcement Learning

Investigating Inverse Reinforcement Learning with Conflicting Demonstrations

Bachelor thesis (2024) - P. Koev (author) , A. Mone (mentor) , L. Cavalcante Siebert (mentor) , Wendelin Böhmer (graduation committee member)

Inverse Reinforcement Learning (IRL) algorithms are closely related to Reinforcement Learning (RL) but instead try to model the reward function from a given set of expert demonstrations. In IRL, many algorithms have been proposed, but most assume consistent demonstrations. Consis ...

Playing Minecraft with Program Synthesis: Adapting FrAngel to Uncover Diverse Subprograms

Bachelor thesis (2024) - G.S. Latsev (author) , Sebastijan Dumančić (mentor) , T.R. Hinnerichs (mentor) , J.W. Böhmer (graduation committee member)

Program synthesis remains largely unexplored in the context of playing games, where exploration and exploitation are crucial for solving tasks within complex environments. FrAngel is a program synthesis algorithm that addresses both of these aspects with its fragments used for th ...

Program Synthesis from Game Rewards Using FrAngel

Finding Complex Subprograms for Solving Minecraft

Bachelor thesis (2024) - A. Guncan (author) , Sebastijan Dumančić (mentor) , T.R. Hinnerichs (mentor) , J.W. Böhmer (graduation committee member)

Program synthesis has been extensively used for automating code-related tasks, but it has yet to be applied in the realm of reward-based games. FrAngel is a component-based program synthesizer that addresses the aspects of exploration and exploitation, both important for the perf ...

Program Synthesis from Rewards with Probe

Adjusting Probe to Increase Exploration When Synthesising Programs from Rewards in Minecraft

Bachelor thesis (2024) - N.M. Mikk (author) , Sebastijan Dumančić (mentor) , T.R. Hinnerichs (mentor) , J.W. Böhmer (graduation committee member)

Program synthesis is the task of generating a program that satisfies some specification. An important aspect of program synthesis is the method of specification. There are various ways in which a desired program can be specified, such as I/O examples, traces, and natural language ...

Program Synthesis from Rewards using Probe and FrAngel

Impact of Exploration-Exploitation Configurations on Probe and FrAngel in Minecraft

Bachelor thesis (2024) - N. Filat (author) , Sebastijan Dumančić (mentor) , T.R. Hinnerichs (mentor) , J.W. Böhmer (graduation committee member)

Program synthesis involves finding a program that meets the user intent, typically provided as input/output examples or formal mathematical specifications. This paper explores a novel specification in program synthesis - learning from rewards.
We explore existing synthesizer ...

All-Atom Novel Protein Sequence Generation Using Discrete Diffusion

Master thesis (2024) - G.J. Admiraal (author) , Amelia Villegas Morcillo (mentor) , J.M. Weber (mentor) , Marcel JT Reinders (mentor) , J.W. Böhmer (graduation committee member)

Advancing protein design is crucial for breakthroughs in medicine and biotechnology, yet traditional approaches often fall short by focusing solely on representing protein sequences using the 20 canonical amino acids. This thesis explores discrete diffusion models for generating ...

Towards faster sequence-to-sequence models for basecalling

Master thesis (2024) - N.A. Ordonez Cardenas (author) , Zaid Al-Ars (graduation committee member) , H. Peter Peter Hofstee (graduation committee member) , Wendelin Böhmer (graduation committee member)

Activity Progress Prediction

Is there progress in video progress prediction methods?

Master thesis (2023) - F. de Boer (author) , Jan Van Gemert (mentor) , Silvia Pintea (mentor) , Wendelin Böhmer (graduation committee member)

In this paper, we investigate the behaviour of current progress prediction methods on the currently used benchmark datasets. We show that the progress prediction methods can fail to extract useful information from visual data on these datasets. Moreover, when the methods fail to ...

Multi-objective Deep Reinforcement Learning for predictive maintenance of road networks

Master thesis (2023) - K. Krachtopoulos (author) , Frans Oliehoek (mentor) , C. P. Andriotis (mentor) , R.T. Loftin (mentor) , J.W. Böhmer (graduation committee member)

Operation and maintenance of the built environment have a major effect on socioeconomic stability and sustainability. A significant part of our built world approaches or has well exceeded its designated structural life. As engineers, we need to find efficient ways to extend this ...

Prioritizing states with action sensitive return in experience replay

Master thesis (2023) - A. Keijzer (author) , Jens Kober (mentor) , D.S. van der Heijden (mentor) , R. Babuska (graduation committee member) , Wendelin Böhmer (graduation committee member)

Experience replay for off-policy reinforcement learning has been shown to improve sample efficiency and stabilize training. However, typical uniformly sampled replay includes many irrelevant samples for the agent to reach good performance. We introduce Action Sensitive Experience ...

Vanishing empirical variance in randomly initialized networks

Master thesis (2023) - M.A. Grzejdziak (author) , David M.J. Tax (mentor) , Marco Loog (graduation committee member) , M. J.T. Reinders (graduation committee member) , J.W. Böhmer (graduation committee member)

Neural networks are commonly initialized to keep the theoretical variance of the hidden pre-activations constant, in order to avoid the vanishing and exploding gradient problem. Though this condition is necessary to train very deep networks, numerous analyses showed that it is no ...

Learning to search & track dynamic targets with graph representations

Master thesis (2023) - S. Cong (author) , J. Alonso-Mora (mentor) , A. Serra Gomez (mentor) , J.W. Böhmer (graduation committee member)

Autonomous robots have been widely applied to search and rescue missions for information gathering about target locations. This process needs to be continuously replanned based on new observations in the environment. For dynamic targets, the robot needs to not only discover them ...

Autonomous robots have been widely applied to search and rescue missions for information gathering about target locations. This process needs to be continuously replanned based on new observations in the environment. For dynamic targets, the robot needs to not only discover them but also keep tracking their positions. Previous works focus on either searching for static targets or tracking dynamic targets given the number of targets and their initial positions. However, the prior information including targets not moving and initial target states can be difficult to obtain in reality. There are also some efforts to solve the search and tracking task jointly by switching between the search mode and the track mode or designing hybrid heuristics. But these methods cannot account for the effect of target movement during the search process, and the trade-off between search and tracking is sensitive to the heuristics.

To overcome the limitations above, in this thesis, we propose a graph formulation of the search and tracking of an unknown number of dynamic targets. The search and tracking problem is decoupled into two parts: search for undiscovered targets and track discovered ones. The search objective is modeled by minimizing the uncertainty in the environment evolving according to a diffusion mechanism and the tracking objective is formulated as minimizing the entropy of target belief distributions. Based on that, we design a novel graph neural network architecture, trained via Reinforcement Learning, that outputs the next motion primitive for the robot to collect information in the environment. We first evaluate this framework in the pure search and the pure tracking tasks. The results show that our method outperforms a variety of baselines both when searching in small and medium-scale environments, and tracking multiple dynamic targets in medium-scale environments. Then the experiments of the search and tracking task validate that our method achieves a better trade-off under equally good search or tracking performance, and scales to a large number of targets.