D.J. Groot | TU Delft Repository

Centralized Landing Flow Merging for Drones Using Deep Reinforcement Learning

Journal article (2026) - A. Vlaskin, D.J. Groot, Emmanuel Sunil, Joost Ellerbroek, J.M. Hoekstra, Dennis Nieuwenhuisen

Drones are expected to support applications such as emergency response, parcel delivery, and infrastructure monitoring in dense urban airspaces, creating traffic levels that are unmanageable for human operators. Autonomous separation management is therefore essential, combining strategic and tactical control to prevent conflicts. This paper addresses the tactical landing phase by introducing a centralized landing flow manager—a reinforcement learning (RL) agent that adjusts drone speed and heading to merge landing flows safely and efficiently prior to a final approach fix. The objective of the work was to demonstrate the potential of reinforcement learning in this novel context, by implementing and evaluating it in simulation and testing its capabilities with 10 concurrent landing drones. The RL agent learns to successfully separate traffic, thereby lowering intrusion counts compared to the baseline autopilot, but is outperformed in safety by the decentralized Modified Voltage Potential (MVP) method due to outlier scenarios. Nevertheless, the RL-based system achieves faster scenario completion and thus a higher overall throughput, by speeding up the vehicles towards the final approach fix. Future work will explore improved network architectures, transfer learning across varied scenarios, and algorithmic fine-tuning to further enhance safety performance. ...

Mixed-Fidelity Reinforcement Learning for Aircraft Conflict-Resolution

Conference paper (2025) - A. Moec, D. J. Groot, J. Ellerbroek

The growing density of civil air traffic is tightening operational safety margins and motivating the search for data-driven conflict-resolution policies. However, the rising compute demand for the training of AI models collides with the need to minimize its environmental impact. In an effort to reduce this climate impact, this paper investigates mixed-fidelity reinforcement learning (MiFi RL) as an alternative to training in high-fidelity (HiFi) simulators only, by first pre-training in a computationally lightweight low-fidelity (LoFi) environment before fine-tuning in HiFi. We analyze this paradigm across five single-agent algorithms – A2C, PPO, DDPG, SAC, and TD3 – using a fixed training budget of 3 million timesteps. Off-policy methods yield a large curriculum benefit: with a 60% LoFi / 40% HiFi split, SAC achieves a 24% increase in evaluated HiFi reward and a 20% reduction in wall-clock training time relative to pure-HiFi training; DDPG attains gains of 37% and 16% at a 40% LoFi share. In contrast, the on-policy algorithms exhibit negligible or negative improvements, possibly underscoring the replay buffer’s role in mitigating the domain shift between simulators. Efficient curriculum setup can alleviate computational load and environmental impact while improving final policy performance. ...

Comparing attention-based methods with long short-term memory for state encoding in reinforcement learning-based separation management

Journal article (2025) - D. J. Groot, J. Ellerbroek, J. M. Hoekstra

Reinforcement learning (RL) is a method that has been studied extensively for the task of conflict-resolution and separation management within air traffic control, offering advantages over analytical methods. One key challenge associated with RL for this task is the construction of the input vector. Because the number of agents in the airspace varies, methods that can handle dynamic number of agents are required. Various methods exist, for example, selecting a fixed number of aircraft, or using methods such as recurrent neural networks or attention to encode the information. Multiple studies have shown promising results using these encoder methods, however, studies comparing these methods are limited and the results remain inconclusive on which method works better. To address this issue, this paper compares different input encoding methods: three different attention methods – scaled dot-product, additive and context aware attention – and long short-term memory (LSTM) with three different sorting strategies. These methods are used as input encoders for different models trained with the Soft Actor–Critic algorithm for separation management in high traffic density scenarios. It is found that additive attention is the most effective at increasing the total safety and maximizing path efficiency, outperforming the commonly used scaled dot-product attention and LSTM. Additionally, it is shown that the order of the input sequence significantly impacts the performance of the LSTM based input encoder. This is in contrast with the attention methods, which are sequence-independent and therefore do not suffer from biases introduced by the order of the input sequence. ...

Analysis of the impact of traffic density on training of reinforcement learning based conflict resolution methods for drones

Journal article (2024) - D. J. Groot, J. Ellerbroek, J. M. Hoekstra

Conventional Air Traffic Control is still predominantly being done by human Air Traffic Controllers, however, as the traffic density increases, the workload of the controllers increases as well. Especially for the area of unmanned aviation, driven by the rise in drones, having human controllers might become unfeasible. One of the methods that is currently being investigated for replacing the conflict resolution task of Air Traffic Control is Reinforcement Learning. As violation of the required separation margins, also called an intrusion, is an event of relatively low frequency, using Reinforcement Learning for this task comes with difficulties that can potentially be attributed to data imbalance. This paper artificially increased the traffic density during the training phase of the Reinforcement Learning method to investigate what the importance is of a balanced data set on the performance of the Reinforcement Learning method. It was found that as the traffic density increased, the Reinforcement Learning methods started to outperform the analytical methods. Beyond this it was found that methods trained at higher traffic densities, but tested at lower traffic densities, outperformed the methods trained at that specific density. This indicates that it might be better to always ensure that the training scenarios are more complex than anticipated during the execution phase, even if that results in unrealistic scenarios. ...

BlueSky-Gym: Reinforcement Learning Environments for Air Traffic Applications

Conference paper (2024) - D.J. Groot, G. Leto, A. Vlaskin, A.A.G. Moëc, Joost Ellerbroek

Reinforcement Learning (RL) is rapidly becoming a mainstay research direction within Air Traffic Management and Control (ATM/ATC). Many international consortia and individual works have explored its applicability to different ATC and U-Space / Urban Aircraft System Traffic Management (UTM) tasks, such as merging traffic flows, with varying levels of success. However, to date there is no common basis on which these RL techniques are compared, with many research parties building their own simulator and scenarios from scratch. This can diminish the value of this research, as the performance of an algorithm cannot be easily verified, or compared to that of other implementations. This hampers development in the long run. The gymnasium library shows for other research domains that this can be solved by providing a set of standardised environments, which can be used to test different algorithms, and compare them to benchmark results. This paper proposes BlueSky-Gym: a library that provides a similar set of test environments for the aviation domain, building on the existing open-source air traffic simulator BlueSky. The current BlueSky-Gym environments range from vertical descent environments, to static obstacle avoidance and traffic flow merging. Built upon the Gymnasium API and the BlueSky air traffic simulator, it delivers an open-source solution for the ATC-specific RL performance benchmark. In the initial release of BlueSky-Gym, 7 functional environments are presented. Preliminary experiments with PPO, SAC, DDPG and TD3 are presented in this paper. Results show stable training is obtained on all of the environments with the default hyperparameters. On some environments, there is a large performance gap, with the on-policy PPO often trailing, but overall no clear algorithm that outperforms others across the board in terms of total reward. ...

Policy Analysis of Safe Vertical Manoeuvring using Reinforcement Learning: Identifying when to Act and when to stay Idle

Conference paper (2023) - D.J. Groot, M.J. Ribeiro, Joost Ellerbroek, J.M. Hoekstra

The number of unmanned aircraft operating in the airspace is expected to grow exponentially during the next decades. This will likely lead to traffic densities that are higher than those currently observed in civil and general aviation, and might require both a different airspace structure compared to conventional aviation, as well as different conflict resolution methods. One of the main disadvantages of analytical conflict resolution methods, in high-traffic density scenarios, is that they can cause instabilities of the airspace due to a domino effect of secondary conflicts. Therefore, many studies have also investigated other methods of conflict resolution, such as Deep Reinforcement Learning, which have shown positive results, but tend to be hard to explain due to their black-box nature. This paper investigates if it is possible to explain the behaviour of a Soft Actor-Critic model, trained for resolving vertical conflicts in a layered urban airspace, by interpreting the policy through a heat map of the selected actions. It was found that the model actively changes its policy depending on the degrees of freedom and has a tendency to adopt preventive behaviour on top of conflict resolution. This behaviour can be directly linked to a decrease in secondary conflicts when compared to analytical methods and can potentially be incorporated into these methods to improve them while maintaining explainability. ...

Enabling Safe and Efficient Separation through Multi-Agent Reinforcement Learning

Conference paper (2022) - D.J. Groot, Joost Ellerbroek, J.M. Hoekstra

Over the next decades, it is expected that the number of unmanned aerial vehicles (UAVs) operating in the airspace will grow rapidly. Both the FAA (Federal Aviation Administration) and the ICAO (International Civil Aviation Organisation) have already stated that aircraft operating autonomously or beyond their operators’ line of sight are required to have detect and avoid capabilities. At higher traffic densities these avoidance manoeuvres can, however, lead to instabilities within the airspace, causing emergent patterns that lead to knock-on effects that can harm the safety of the operations. It might be possible to formulate a cost function that encapsulates global safety, rather than individual safety, stimulating both safety and stability. One method that lends itself for optimizing such a cost function is cooperative Multi-Agent Reinforcement Learning (MARL). It has been demonstrated that MARL can be used for optimization in both competitive and cooperative (or even mixed) environments, however, when applied in a completely decentralized manner, stability issues often arise. It is therefore proposed to investigate the application of MARL for a well known centralized domain, ATC for manned aviation. This doctoral paper breaks down the proposed research project into 4 independent phases that individually contribute to the knowledge of applying MARL for ATC. ...

Using Relative State Transformer Models for Multi-Agent Reinforcement Learning in Air Traffic Control

Conference paper (2022) - D.J. Groot, Joost Ellerbroek, J.M. Hoekstra

Deep Reinforcement Learning has seen more usage in the field of Air Traffic Control over the last couple of years. As the number of aircraft in a given sector of airspace is not constant, there is a need for methods to be invariant to the number of agents in the system. Often this is done by making a selection of the aircraft that will be included in the state, which introduces human biases. Another option that has been used is Recurrent Neural Networks to process the entire sequence of aircraft present. These methods however are sequence-dependent and can give different results depending on the order that the aircraft are given, which is undesirable. Methods that solely rely on attention mechanisms, such as transformers, allow sequential data to be processed in a sequence-invariant manner by using multi-head attention mechanisms. However, because traditional Transformers operate on individual tokens, this does not allow for relative state information to be encoded into the hidden state. This paper shows that by performing a transformation operation on the key and value tokens, it is possible to use Transformers on relative states, at the cost of a factor (N-1) additional attention computations, where N is the number of agents in the system. This adaptation allows relative state Transformers to obtain significantly higher performance than standard Transformers. The results also showed that using attention mechanisms to construct the initial observation vector out of a total of 20 agents results in similar, but slightly lower, performance to handcrafted observation vectors, without requiring manual selection of the important agents. Future research should investigate whether additional changes to the attention mechanisms and their training can result in higher performance. ...

Lateral and Vertical Air Traffic Control Under Uncertainty Using Reinforcement Learning

Conference paper (2022) - C. Badea, D.J. Groot, A. Morfin Veytia, M.J. Ribeiro, Ramon Dalmau, J. Ellerbroek, J.M. Hoekstra

Air traffic demand has increased at an unprecedented rate in the last decade (albeit interrupted by the COVID pandemic), but capacity has not increased at the same rate. Higher levels of automation and the implementation of decision-support tools for air traffic controllers could help increase capacity and catch up with demand. The air traffic control problem can be effectively modelled as a Markov game, where a team of aircraft (the agents) interact in the airspace (the environment) and cooperatively take resolution actions to achieve a common goal: safe separation in the most efficient way. As in any Markov game, the optimal policy for the team could be learnt through trial and error in a simulated environment using reinforcement learning algorithms. In this paper, we use the soft actor-critic algorithm to unravel the optimal air traffic control policy. Unlike some previous works, we propose a global (i.e., shared) reward that encourages cooperative behaviour. Furthermore, we propose a versatile policy model capable of performing heading, speed, and/or altitude resolution actions. We also demonstrate that the policy is robust and can maintain safe separation even in the presence of uncertainty regarding aircraft position, delays in implementing resolution actions, and wind. The findings of this paper also suggest that there is still significant room for improvement when controlling three degrees of freedom at the same time. ...

Improving Safety of Vertical Manoeuvres in a Layered Airspace with Deep Reinforcement Learning

Conference paper (2022) - D.J. Groot, M.J. Ribeiro, J. Ellerbroek, J.M. Hoekstra

Current estimates show that the presence of unmanned aviation is likely to grow exponentially over the course of the next decades. Even with the more conservative estimates, these expected high traffic densities require a re-evaluation of the airspace structure to ensure safe and efficient operations. One structure that scored high on both the safety and efficiency metrics, as defined by the Metropolis project, is a layered airspace, where aircraft with an intended heading are assigned to a specific altitude layer. However, a problem arises once aircraft start to vertically traverse between these layers, leading to a large number of conflicts and intrusions. One way to potentially reduce the number of intrusions during these operations is by using conventional conflict resolution algorithms. These algorithms however have also been shown to lead to instabilities at higher traffic densities. As recent years have shown tremendous growth in the capabilities of Deep Reinforcement Learning, it is interesting to see how well these methods perform in the field of conflict resolution. This research investigates and compares the performance of multiple Soft Actor Critic models with the Modified Voltage Potential algorithm during vertical manoeuvres in a layered airspace. The final obtained performance of the trained models is comparable to that of the Modified Voltage Potential algorithm and in certain scenarios, the trained models even outperform the MVP algorithm. Overall, the results show that DRL can improve upon the current state of conflict resolution algorithms and provide new insight into the development of safe operations. ...