Circular Image

S.M.B.S. Samarakoon Mudiyanselage

info

Please Note

5 records found

What Each Modality Reveals About Faults — and What It Misses

Cloud-native 5G Core networks emit metrics, logs, and distributed traces, yet faults are typically diagnosed within a single modality. We show that the relationships between these modalities carry fault information that single-signal analysis misses, and we use them to characterize faults in a containerized Open5GS testbed. Our method computes the change in Spearman rank correlation between cross-modal signal pairs, from a pre-fault baseline to the fault window, yielding a coupling-change metric ∆|ρ|. Across 22 operational fault scenarios over seven independent deployments, and 6 security scenarios over three, the analysis surfaces reproducible coupling signatures and, more importantly, modality blindspots that follow the 5G interface architecture: because distributed tracing instruments only the Service-Based Interface, N2 interface partitions and N4/PFCP session faults are trace-blind and characterizable only through metrics-logs coupling, whereas a valid-request NRF flood produces no error logs. No single modality covers every fault type. We treat classification as an analytical instrument rather than a goal: cross-modal coupling features are deliberately weaker classifiers than raw per-signal features, and a SHAP analysis shows the two views rely on different modality pairs, consistent with the coupling view’s value being characterization rather than accuracy. The security-fault results are preliminary, owing to the small, low-variance dataset. The contribution is a reproducible, architecture-grounded map of which modality reveals, and which is blind to, each fault. ...
Modern cloud-native systems generate large amounts of telemetry data, including logs, metrics, and traces, which are useful for monitoring and diagnosing system behavior. However, the effectiveness of machine learning-based anomaly detection varies significantly depending on the telemetry modality and the nature of the faults.

With the ever-increasing demands of 5G applications and upcoming 6G systems, operators must ensure that their networks can rapidly respond to and mitigate faults, with detection being the first step. This project investigates the performance of machine learning models for anomaly detection when logs, metrics, and traces are analyzed independently, with the goal of understanding their relative strengths and limitations.

The study analyzes the performance of fifteen models across five different fault classes, comprising 22 faults in total. After creating an appropriate dataset, each model was evaluated using data collected from several runs. The results show that a single modality cannot detect all faults, two modalities can detect all but one fault, and all three modalities together can detect every fault. ...

Mapping fault classes to observability signals across infrastructure, orchestration, and application layers

5G core networks are moving from monolithic applications to containerised microservices on Kubernetes. This brings flexibility and scalability, but it also makes faults harder to detect, since a single fault can surface in the physical infrastructure, the Kubernetes orchestration, and the network functions at the same time. There is little empirical evidence on which telemetry signals reveal which kind of fault. This paper presents a fault atlas for a cloud-native 5G core: an empirical mapping from 22 injected faults in eight classes (resource stress, pod crashes, network degradation, attacks on the Packet forwarding Control Protocol (PFCP), and dependency failures) to the 40 observability signals, collected from metrics (including user-plane round-trip time), logs, traces, and Kubernetes events, that detect them. On our setup, all but one of the 22 faults were detectable in more than one architecture layer, and within each fault class the layers reacted in a stable order. The orchestration layer, typically the most closely watched one, detected only 10 of the 22 faults and missed all CPU-stress, network-delay, and PFCP-attack faults, while 21 of the 22 faults were visible in more than one of the four telemetry modalities. The atlas is robust to the statistical methodology (at least 95.9% of signals unchanged under threshold and detector variations), and a second independent run agreed on 93.1% of its cells, with the differences confined to near-threshold signals. The atlas and the pipeline that generated it are released as a reusable ground truth for fault-detection and cross-layer diagnosis research. ...

Trade-off analysis in terms of CPU overhead, storage requirements, volume reduction and retained system visibility

Cloud-native 5G core networks generate large volumes of heterogeneous log data across multiple microservice components, making telemetry management a critical operational challenge. Most existing log reduction techniques have not been evaluated on 5G core logs in particular, so the best approach for reducing log volume in such a system remains unclear. This paper investigates five log reduction strategies - LogShrink, Denum, SALO, Drain and Log Preprocessing - applied to an Open5GS deployment on a Kubernetes-in-Docker cluster under ten scenarios (steady-state, bursty traffic, and eight fault injections). The strategies are evaluated across volume reduction, CPU overhead and five system visibility metrics.
The two strategy families (online and offline) operate on different inputs and use separate baselines, so their figures are not directly comparable. Lossless offline strategies (LogShrink and Denum) achieve 83–96% byte reduction with full visibility preservation, with Denum far more resource-efficient than LogShrink. Lossy online strategies (SALO, Log Preprocessing, Drain), on the other hand, reduce real-time log streams by 53–89% at low cluster overhead but significantly reduce fault-signal retention. No single strategy dominates all dimensions simultaneously. The study provides a framework for selecting log reduction strategies in cloud-native 5G deployments based on specific operational constraints. ...
Cloud-native 5G core networks transform network functions into containerised microservices, which simplifies their management but fragments their observability across multiple telemetry layers. Monitoring these systems requires balancing between visibility and the overhead created by the control plane being observed. This paper evaluates two fundamentally different collection paradigms: pull-based scraping via Prometheus and eBPF auto-instrumentation via Grafana Beyla, on a live Open5GS 5G core deployed on a three-node Kubernetes cluster. Each stack is deployed in isolation and in combination: resource overhead is quantified across multiple granularity settings, scalability is measured by changing the number of Network Functions (NFs) being monitored at a time, and fault-detection coverage is assessed over 22 injected scenarios across five fault classes, using Chaos Mesh for controlled injection. Prometheus incurs substantially higher monitoring stack overhead; Beyla’s sampling rate has negligible effect on the cost, because kernel uprobes fire on every HTTP/2 library call regardless of the sampling decision. For fault observability, across all 3 runs, Beyla flags all 22 fault types in at least one of those runs, while Prometheus misses only one (NRF cascade failure). However, throughout all three runs together, only 10 / 22 faults are detected reliably by both methods. Per-run reliability favours Prometheus (87.9% vs. 81.8%). We conclude that Beyla offers broader fault-type coverage at lower overhead, but Prometheus provides more consistent detection per individual injection. ...