Federated learning (FL) enables privacy-preserving collaboration among numerous clients for training machine learning models. In FL, a server coordinates model aggregation while preserving data privacy. However, non-identically and independently distributed (non-IID) local data l
...
Federated learning (FL) enables privacy-preserving collaboration among numerous clients for training machine learning models. In FL, a server coordinates model aggregation while preserving data privacy. However, non-identically and independently distributed (non-IID) local data label distributions degrade the performance of the global model. This paper investigates the impact of synthetic data on mitigating non-IID data distributions in federated learning. We explore data-based augmentation techniques, including uniform and minority imputation, utilizing conditional variational autoencoders (CVAEs) to generate synthetic data.
Additionally, we examine a framework-based approach where a pre-trained model, centrally trained on synthetic data, is distributed to clients for finetuning on their original datasets. Our results, which use the binarized MNIST dataset, demonstrate a quality gap between synthetic and original datasets, leading to diminished classification performance when trained on only synthetic data. Integrating both original and synthetic data improves performance on heavily imbalanced label distributions. At the same time, uniform imputation experiments reveal that optimal imputation must strike a balance, with performance degradation being noticeable when datasets consist of more than 45\% synthetic images. Synthetic imputation did not suffer from degradation in the explored range of imputation amounts and achieved an average F1 score improvement of 0.015 over uniform imputation.