A Cross-Layer Fault Atlas for Cloud-Native 5G Core Networks
Mapping fault classes to observability signals across infrastructure, orchestration, and application layers
B. Bonev (TU Delft - Electrical Engineering, Mathematics and Computer Science)
S.M.B.S. Samarakoon Mudiyanselage – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)
Nitinder Mohan – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)
Jérémie Decouchant – Graduation committee member (TU Delft - Electrical Engineering, Mathematics and Computer Science)
More Info
expand_more
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.
Abstract
5G core networks are moving from monolithic applications to containerised microservices on Kubernetes. This brings flexibility and scalability, but it also makes faults harder to detect, since a single fault can surface in the physical infrastructure, the Kubernetes orchestration, and the network functions at the same time. There is little empirical evidence on which telemetry signals reveal which kind of fault. This paper presents a fault atlas for a cloud-native 5G core: an empirical mapping from 22 injected faults in eight classes (resource stress, pod crashes, network degradation, attacks on the Packet forwarding Control Protocol (PFCP), and dependency failures) to the 40 observability signals, collected from metrics (including user-plane round-trip time), logs, traces, and Kubernetes events, that detect them. On our setup, all but one of the 22 faults were detectable in more than one architecture layer, and within each fault class the layers reacted in a stable order. The orchestration layer, typically the most closely watched one, detected only 10 of the 22 faults and missed all CPU-stress, network-delay, and PFCP-attack faults, while 21 of the 22 faults were visible in more than one of the four telemetry modalities. The atlas is robust to the statistical methodology (at least 95.9% of signals unchanged under threshold and detector variations), and a second independent run agreed on 93.1% of its cells, with the differences confined to near-threshold signals. The atlas and the pipeline that generated it are released as a reusable ground truth for fault-detection and cross-layer diagnosis research.