D.J. Groot
Please Note
10 records found
1
The growing density of civil air traffic is tightening operational safety margins and motivating the search for data-driven conflict-resolution policies. However, the rising compute demand for the training of AI models collides with the need to minimize its environmental impact. In an effort to reduce this climate impact, this paper investigates mixed-fidelity reinforcement learning (MiFi RL) as an alternative to training in high-fidelity (HiFi) simulators only, by first pre-training in a computationally lightweight low-fidelity (LoFi) environment before fine-tuning in HiFi. We analyze this paradigm across five single-agent algorithms – A2C, PPO, DDPG, SAC, and TD3 – using a fixed training budget of 3 million timesteps. Off-policy methods yield a large curriculum benefit: with a 60% LoFi / 40% HiFi split, SAC achieves a 24% increase in evaluated HiFi reward and a 20% reduction in wall-clock training time relative to pure-HiFi training; DDPG attains gains of 37% and 16% at a 40% LoFi share. In contrast, the on-policy algorithms exhibit negligible or negative improvements, possibly underscoring the replay buffer’s role in mitigating the domain shift between simulators. Efficient curriculum setup can alleviate computational load and environmental impact while improving final policy performance.
Reinforcement learning (RL) is a method that has been studied extensively for the task of conflict-resolution and separation management within air traffic control, offering advantages over analytical methods. One key challenge associated with RL for this task is the construction of the input vector. Because the number of agents in the airspace varies, methods that can handle dynamic number of agents are required. Various methods exist, for example, selecting a fixed number of aircraft, or using methods such as recurrent neural networks or attention to encode the information. Multiple studies have shown promising results using these encoder methods, however, studies comparing these methods are limited and the results remain inconclusive on which method works better. To address this issue, this paper compares different input encoding methods: three different attention methods – scaled dot-product, additive and context aware attention – and long short-term memory (LSTM) with three different sorting strategies. These methods are used as input encoders for different models trained with the Soft Actor–Critic algorithm for separation management in high traffic density scenarios. It is found that additive attention is the most effective at increasing the total safety and maximizing path efficiency, outperforming the commonly used scaled dot-product attention and LSTM. Additionally, it is shown that the order of the input sequence significantly impacts the performance of the LSTM based input encoder. This is in contrast with the attention methods, which are sequence-independent and therefore do not suffer from biases introduced by the order of the input sequence.
Conventional Air Traffic Control is still predominantly being done by human Air Traffic Controllers, however, as the traffic density increases, the workload of the controllers increases as well. Especially for the area of unmanned aviation, driven by the rise in drones, having human controllers might become unfeasible. One of the methods that is currently being investigated for replacing the conflict resolution task of Air Traffic Control is Reinforcement Learning. As violation of the required separation margins, also called an intrusion, is an event of relatively low frequency, using Reinforcement Learning for this task comes with difficulties that can potentially be attributed to data imbalance. This paper artificially increased the traffic density during the training phase of the Reinforcement Learning method to investigate what the importance is of a balanced data set on the performance of the Reinforcement Learning method. It was found that as the traffic density increased, the Reinforcement Learning methods started to outperform the analytical methods. Beyond this it was found that methods trained at higher traffic densities, but tested at lower traffic densities, outperformed the methods trained at that specific density. This indicates that it might be better to always ensure that the training scenarios are more complex than anticipated during the execution phase, even if that results in unrealistic scenarios.