Y. Wang | TU Delft Repository

Data quality improvement through data cleaning and augmentation methods

How do different tabular imputation techniques compare when addressing missing values in 6G datasets?

Bachelor thesis (2026) - H.K.K. Chan, R. Hai, Y. Wang, J. Urbano Merino

Sixth-generation (6G) wireless systems depend on data-hungry machine-learning pipelines, yet datasets collected from heterogeneous sources frequently contain missing values that bias models and degrade simulation reliability. Tabular imputation has been studied extensively— from statistical baselines (mean, kNN) through model-based methods (MICE, SoftImpute) to recent deep approaches (HyperImpute, GRAPE, DiffPuter)—but no prior work systematically compares this range on 6G data under realistic missingness. We benchmark seven methods on DeepSense 6G datasets across four mechanisms and three missingness rates, evaluating reconstruction accuracy, statistical fidelity, and downstream beam-prediction performance. Our benchmarks show that no single imputation method consistently dominates; performance depends on the missingness mechanism. Under cell-wise missingness, deep methods such as HyperImpute achieve the highest reconstruction fidelity, though downstream beam prediction remains robust to these localised corruptions. In contrast, row-wise missingness degrades all learned and deep approaches by breaking cross-feature dependencies. Here, kNN is the only method that consistently preserves the downstream label signal. Overall, our results provide guidance for 6G pipeline defaults and highlight the limitations of applying purely tabular imputation to temporal wireless data. ...

Benchmarking Multivariate Time-Series Imputation in 6G Networks

A Comparative Study of Deep Learning and Classical Frameworks

Bachelor thesis (2026) - A. Neri, Y. Wang, R. Hai, J. Urbano Merino

Sixth-Generation (6G) telecommunications rely on high-frequency millimeter-wave (mmWave) bands for massive data rates, but their physical fragility makes them highly susceptible to line-of-sight blockages. These blockages cause contiguous telemetry outages, creating a single point of failure for edge routing and orchestration protocols demanding continuous system data. To address this, we introduce an evaluation pipeline benchmarking five time-series imputation architectures, from statistical baselines (Nearest Neighbor, Kalman Filter) to complex deep learning models (BRITS, CSDI, TimesNet). Utilizing an open-source microservice dataset, the pipeline dynamically injects simulated blockages across a 24-scenario grid, escalating from minor drops to 60-second outages. Performance is evaluated across an accuracy-latency Pareto frontier. Results demonstrate that the recurrent architecture, BRITS, achieves the highest overall reconstruction fidelity. However, Nearest Neighbor emerges as the optimal low-latency baseline, maintaining competitive accuracy while consistently executing in under 250 milliseconds. Finally, contextualizing these findings reveals a critical limitation: the architectures achieving peak accuracy inherently rely on offline, bidirectional processing to reconcile telemetry gaps. This highlights a significant research opportunity, emphasizing the need to evaluate deep learning models in strictly online, forward-only forecasting configurations to meet the split-second streaming realities of live 6G edge deployment. ...

Outlier and Anomaly-Handling for 6G Wireless Measurement Data

A Systematic, Downstream-Centric Comparison of Statistical Filters and Unsupervised Outlier Detectors for Tabular and Time-Series 6G Network Measurements

Bachelor thesis (2026) - M. Stanescu, R. Hai, Y. Wang, J. Urbano Merino

Machine-learning-driven management of next-generation (6G) networks depends on measurements that are routinely corrupted by sensor noise, hardware imperfections, bursty interference, and malicious activity, so outlier handling is widely assumed to be a prerequisite for reliable downstream models. Whether, and which, cleaning methods actually help, and whether this differs across data modalities, remains unclear. Using two real, labelled datasets with no synthetic contamination, attack traffic from a functional 5G testbed (attack classification) and an operational web-latency KPI series (short-term forecasting), we systematically compare six outlier-handling methods, namely interpretable statistical filters (robust Z-score replacement, IQR clipping, and Savitzky–Golay smoothing) and unsupervised detectors (Isolation Forest, Local Outlier Factor, and PCA reconstruction), against a no-cleaning baseline. Hyperparameters are tuned without access to held-out labels; each method is evaluated under both a robust (Random Forest) and a noise-sensitive (k-NN) downstream model, with paired significance tests and false-discovery-rate (FDR) correction, a detection diagnostic, and runtime. The result is largely negative: after FDR correction, no method significantly improves downstream performance on either modality. Savitzky–Golay smoothing gives the only suggestive forecasting gain (≈17% lower error under Random Forest) but does not survive correction; deletion- and clipping-based methods are neutral-to-harmful (IQR significantly degrades classification); and the unsupervised detectors rank real attacks barely above chance (ROC-AUC 0.54–0.60), even though a supervised model separates the same classes at 0.86, statistical outlier detection is a poor proxy for the anomalies of interest. As the slowest detectors are also the most harmful and exceed the near-real-time control budget, we conclude that a generic outlier-handling stage offers no reliable benefit for these tasks: its value must be demonstrated rather than assumed, with lightweight smoothing the only candidate worth trying on noisy sequential signals. ...

Machine-learning-driven management of next-generation (6G) networks depends on measurements that are routinely corrupted by sensor noise, hardware imperfections, bursty interference, and malicious activity, so outlier handling is widely assumed to be a prerequisite for reliable downstream models. Whether, and which, cleaning methods actually help, and whether this differs across data modalities, remains unclear. Using two real, labelled datasets with no synthetic contamination, attack traffic from a functional 5G testbed (attack classification) and an operational web-latency KPI series (short-term forecasting), we systematically compare six outlier-handling methods, namely interpretable statistical filters (robust Z-score replacement, IQR clipping, and Savitzky–Golay smoothing) and unsupervised detectors (Isolation Forest, Local Outlier Factor, and PCA reconstruction), against a no-cleaning baseline. Hyperparameters are tuned without access to held-out labels; each method is evaluated under both a robust (Random Forest) and a noise-sensitive (k-NN) downstream model, with paired significance tests and false-discovery-rate (FDR) correction, a detection diagnostic, and runtime. The result is largely negative: after FDR correction, no method significantly improves downstream performance on either modality. Savitzky–Golay smoothing gives the only suggestive forecasting gain (≈17% lower error under Random Forest) but does not survive correction; deletion- and clipping-based methods are neutral-to-harmful (IQR significantly degrades classification); and the unsupervised detectors rank real attacks barely above chance (ROC-AUC 0.54–0.60), even though a supervised model separates the same classes at 0.86, statistical outlier detection is a poor proxy for the anomalies of interest. As the slowest detectors are also the most harmful and exceed the near-real-time control budget, we conclude that a generic outlier-handling stage offers no reliable benefit for these tasks: its value must be demonstrated rather than assumed, with lightweight smoothing the only candidate worth trying on noisy sequential signals.

Tabular and Time-Series Position Encodings in 6G Network Data

Investigating the Effects on Beam-Prediction Performance and Representation Quality

Bachelor thesis (2026) - P. Fernández Luengo, R. Hai, Y. Wang, J. Urbano Merino

Sixth-generation (6G) networks collect positioning data that must be transformed into a suitable representation before machine-learning models can use it effectively. The choice of this encoding is rarely treated as an experimental variable, yet it strongly shapes what information reaches the downstream model. This paper evaluates how tabular and time-series encoding techniques affect beam prediction performance and feature representation quality in nine scenarios from the DeepSense 6G dataset. Beam-prediction performance is measured using two downstream classifiers in a fixed multi-seed evaluation pipeline, while representation quality is assessed through invariance under positional noise. Encodings that represent the user equipment relative to the base station and include temporal context achieve the best performance. However, the representation analysis reveals that these geometry-aware encodings are less stable under positional noise. The findings suggest that, when position estimates are accurate, position and trajectory data should be encoded using base-station-relative distance, bearing and recent geometric change, whereas noisier settings may require additional preprocessing to preserve robustness. ...

Evaluating Tabular and Time-Series Data Augmentation for 6G-Relevant Network-Performance Regression

Bachelor thesis (2026) - Q.T. den Haan, R. Hai, Y. Wang, J. Urbano Merino

Data-driven methods are expected to play an important role in future sixth-generation (6G) wireless systems, where network data can support performance prediction, simulation, and network optimization. However, collecting large and representative network-performance datasets can be difficult, which motivates the use of data augmentation. This study evaluates how different tabular and time-series augmentation techniques compare when addressing data scarcity in datasets relevant to future 6G systems. Two regression tasks are studied: a tabular AMF performance task using XGBoost and a time-series Python web-server performance task using an LSTM. Four tabular augmentation methods are evaluated: Gaussian Noise, SMOGN, CTGAN, and TVAE. Four time-series augmentation methods are evaluated: Jittering, Time Warping, TS-Mixup, and Frequency-domain augmentation. The methods are compared using downstream regression performance, statistical realism metrics, and diagnostic analysis of augmented data and test-set residuals. The results show that augmentation does not consistently improve regression performance. In the tabular task, all augmentation methods reduced performance compared with the XGBoost baseline. In the time-series task, Frequency-domain augmentation was the only method that improved the LSTM baseline, substantially reducing RMSE and MAE, although the final test-set $R^2$ remained negative. The diagnostics suggest that useful augmentation depends not only on preserving marginal distributions or value ranges, but also on preserving task-relevant feature-target relationships and temporal structure. Overall, the findings show that augmentation effectiveness is method- and data-type dependent, and that predictive performance should be evaluated together with statistical fidelity diagnostics. ...

Can Context-Aware Incremental Nets Outperform GBDTs Over Time?

A Tabular Lifelong-Learning Study

Master thesis (2025) - F.M. Gunnarsson, M.S. Pera, R. Hai, Y. Wang, F. Fang, J. Roeder, S. van Haren

Modern machine learning systems face unprecedented challenges in processing continuously arriving data streams while maintaining both computational efficiency and privacy compliance. Traditional batch learning approaches exhibit quadratic scaling in memory and computational requirements, making them unsuitable for long-term deployment in resource-constrained environments. Despite significant advances in continual learning for computer vision and natural language processing, tabular data represents the majority of industrial machine learning applications.

This thesis introduces IMLP (Incremental MLP), an attention-based architecture for energy-efficient continual learning on tabular data streams. IMLP augments a standard multilayer perceptron with attention-based feature rehearsal, maintaining a fixed-size buffer of learned 256-dimensional representations rather than raw historical samples. This design achieves constant computational complexity regardless of stream length while preserving task-relevant knowledge without storing personally identifiable information.

We conduct comprehensive evaluation across 36 diverse TabZilla classification tasks against 14 baseline methods spanning gradient boosting, classical machine learning, and neural architectures. Using calibrated power measurement equipment and rigorous statistical analysis via Friedman omnibus tests with post-hoc comparisons, we establish that IMLP achieves a $4.2\times$ median speedup and 79.6\% energy reduction compared to standard MLPs while maintaining competitive accuracy (80.6\% vs 82.9\% balanced accuracy).

Our key findings demonstrate that IMLP successfully trades a modest 2.3 percentage point accuracy reduction for substantial efficiency gains, achieving 97.5\% of cumulative learning performance using only current segment data. The approach proves robust across datasets spanning 5 to 2,000 features and diverse domains including medical diagnosis, sensor data, and financial applications. Moreover, we introduce NetScore-T, a composite metric for evaluating accuracy-efficiency trade-offs, positioning IMLP optimally on the neural network Pareto frontier.

Therefore, this work establishes the feasibility of practical continual learning for resource-constrained environments while contributing the first systematic study of energy consumption in neural continual learning for tabular data, enabling deployment scenarios previously considered computationally infeasible.
...

Modern machine learning systems face unprecedented challenges in processing continuously arriving data streams while maintaining both computational efficiency and privacy compliance. Traditional batch learning approaches exhibit quadratic scaling in memory and computational requirements, making them unsuitable for long-term deployment in resource-constrained environments. Despite significant advances in continual learning for computer vision and natural language processing, tabular data represents the majority of industrial machine learning applications.

This thesis introduces IMLP (Incremental MLP), an attention-based architecture for energy-efficient continual learning on tabular data streams. IMLP augments a standard multilayer perceptron with attention-based feature rehearsal, maintaining a fixed-size buffer of learned 256-dimensional representations rather than raw historical samples. This design achieves constant computational complexity regardless of stream length while preserving task-relevant knowledge without storing personally identifiable information.

We conduct comprehensive evaluation across 36 diverse TabZilla classification tasks against 14 baseline methods spanning gradient boosting, classical machine learning, and neural architectures. Using calibrated power measurement equipment and rigorous statistical analysis via Friedman omnibus tests with post-hoc comparisons, we establish that IMLP achieves a $4.2\times$ median speedup and 79.6\% energy reduction compared to standard MLPs while maintaining competitive accuracy (80.6\% vs 82.9\% balanced accuracy).

Our key findings demonstrate that IMLP successfully trades a modest 2.3 percentage point accuracy reduction for substantial efficiency gains, achieving 97.5\% of cumulative learning performance using only current segment data. The approach proves robust across datasets spanning 5 to 2,000 features and diverse domains including medical diagnosis, sensor data, and financial applications. Moreover, we introduce NetScore-T, a composite metric for evaluating accuracy-efficiency trade-offs, positioning IMLP optimally on the neural network Pareto frontier.

Therefore, this work establishes the feasibility of practical continual learning for resource-constrained environments while contributing the first systematic study of energy consumption in neural continual learning for tabular data, enabling deployment scenarios previously considered computationally infeasible.