Structured Command Extraction from ATC Communications Using Open and Fine-Tuned Language Models

None, None; None, None; None, None; None, None

Structured Command Extraction from ATC Communications Using Open and Fine-Tuned Language Models

Conference Paper (2025)

Author(s)

Ana Maria Mekerishvili (Student TU Delft)

Junzi Sun (TU Delft - Aerospace Engineering)

Patrick Jonk (Royal Netherlands Aerospace Centre)

Vincent de Vries (Royal Netherlands Aerospace Centre)

Research Group

Operations & Environment

Language model Speech-to-text Radiotelephony Structured command extraction

To reference this document use

https://resolver.tudelft.nl/uuid:59ed17ce-578e-48fb-8247-de393ac30488

More Info

expand_more

Publication Year

2025

Language

English

Research Group

Operations & Environment

Event

15th SESAR Innovation Days, SIDs 2025 (2025-12-01 - 2025-12-04), Bled, Slovenia

Downloads counter

11

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Radiotelephony remains the primary medium for pilot-controller communication, yet extracting structured information from spoken exchanges is challenging. Deep learning approaches often depend on large annotated datasets, limiting use in data-scarce environments. This study evaluates open-source Large Language Models for Structured Information Extraction from ATC communications, with applications in assisting or automating pseudo-pilot tasks. We evaluate Llama 3.3 (70B) with baseline prompting and Gemma 3 (4B) with baseline and fine-tuned variants on 496 utterances from NLR’s ATM simulator: NARSIM (NLR ATC real-time simulator). Performance is assessed on human transcripts and ASR outputs from Whisper models, with varying prompt contexts. Cross-sector generalization is tested across two ATC sectors. Using manual scoring, Llama 3.3 achieves micro-F1 0.95 on human transcripts and 0.86 on fine-tuned Whisper outputs. While Gemma 3 performed weaker in its baseline form, fine-tuning on a small sample led to notable improvements. Results demonstrate the potential of LLMs for ATC applications without the need for large annotated datasets.

Files

SIDs_2025_paper_50-final.pdf

(pdf | 2.46 Mb)