Mixed-Fidelity Reinforcement Learning for Aircraft Conflict-Resolution

None, None; None, None; None, None

Mixed-Fidelity Reinforcement Learning for Aircraft Conflict-Resolution

Conference Paper (2025)

Author(s)

A. Moec (TU Delft - Aerospace Engineering)

D. J. Groot (TU Delft - Aerospace Engineering)

J. Ellerbroek (TU Delft - Aerospace Engineering)

Research Group

Operations & Environment

Artificial Intelligence Air Traffic Management (ATM) Aircraft Conflict-Resolution BlueSky Simulator High-Fidelity Simulation Mixed-Fidelity Reinforcement Learning (MiFi RL) Soft-Actor-Critic (SAC)

To reference this document use

https://resolver.tudelft.nl/uuid:4a18ef1f-b276-49fb-b0ff-dbf582da79e9

More Info

expand_more

Publication Year

2025

Language

English

Research Group

Operations & Environment

Event

15th SESAR Innovation Days, SIDs 2025 (2025-12-01 - 2025-12-04), Bled, Slovenia

Downloads counter

21

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

The growing density of civil air traffic is tightening operational safety margins and motivating the search for data-driven conflict-resolution policies. However, the rising compute demand for the training of AI models collides with the need to minimize its environmental impact. In an effort to reduce this climate impact, this paper investigates mixed-fidelity reinforcement learning (MiFi RL) as an alternative to training in high-fidelity (HiFi) simulators only, by first pre-training in a computationally lightweight low-fidelity (LoFi) environment before fine-tuning in HiFi. We analyze this paradigm across five single-agent algorithms – A2C, PPO, DDPG, SAC, and TD3 – using a fixed training budget of 3 million timesteps. Off-policy methods yield a large curriculum benefit: with a 60% LoFi / 40% HiFi split, SAC achieves a 24% increase in evaluated HiFi reward and a 20% reduction in wall-clock training time relative to pure-HiFi training; DDPG attains gains of 37% and 16% at a 40% LoFi share. In contrast, the on-policy algorithms exhibit negligible or negative improvements, possibly underscoring the replay buffer’s role in mitigating the domain shift between simulators. Efficient curriculum setup can alleviate computational load and environmental impact while improving final policy performance.

Files

SIDs_2025_paper_17-final.pdf

(pdf | 5.4 Mb)