Data quality improvement through data cleaning and augmentation methods

None, None

Data quality improvement through data cleaning and augmentation methods

How do different tabular imputation techniques compare when addressing missing values in 6G datasets?

Bachelor Thesis (2026)

Author(s)

H.K.K. Chan (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

R. Hai – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Y. Wang – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

J. Urbano Merino – Graduation committee member (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Faculty

Electrical Engineering, Mathematics and Computer Science

Missing data Imputation 6G

To reference this document use

https://resolver.tudelft.nl/uuid:f685964c-9d0e-45c1-8d89-9a5a44dea625

More Info

expand_more

Publication Year

2026

Language

English

Graduation Date

26-06-2026

Awarding Institution

Delft University of Technology

Project

CSE3000 Research Project

Programme

Computer Science and Engineering

Faculty

Electrical Engineering, Mathematics and Computer Science

Downloads counter

25

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Sixth-generation (6G) wireless systems depend on data-hungry machine-learning pipelines, yet datasets collected from heterogeneous sources frequently contain missing values that bias models and degrade simulation reliability. Tabular imputation has been studied extensively— from statistical baselines (mean, kNN) through model-based methods (MICE, SoftImpute) to recent deep approaches (HyperImpute, GRAPE, DiffPuter)—but no prior work systematically compares this range on 6G data under realistic missingness. We benchmark seven methods on DeepSense 6G datasets across four mechanisms and three missingness rates, evaluating reconstruction accuracy, statistical fidelity, and downstream beam-prediction performance. Our benchmarks show that no single imputation method consistently dominates; performance depends on the missingness mechanism. Under cell-wise missingness, deep methods such as HyperImpute achieve the highest reconstruction fidelity, though downstream beam prediction remains robust to these localised corruptions. In contrast, row-wise missingness degrades all learned and deep approaches by breaking cross-feature dependencies. Here, kNN is the only method that consistently preserves the downstream label signal. Overall, our results provide guidance for 6G pipeline defaults and highlight the limitations of applying purely tabular imputation to temporal wireless data.

Files

TabularImputation6G.pdf

(pdf | 0.641 Mb)

License info not available