Effect of Different Data Augmentation Strategies on Performance In Federated Learning Systems

Bachelor Thesis (2024)
Author(s)

L.S. Yadala Chanchu (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

S.J.F. Garst – Mentor (TU Delft - Pattern Recognition and Bioinformatics)

David M.J. Tax – Mentor (TU Delft - Pattern Recognition and Bioinformatics)

Alexios Voulimeneas – Graduation committee member (TU Delft - Cyber Security)

Faculty
Electrical Engineering, Mathematics and Computer Science
More Info
expand_more
Publication Year
2024
Language
English
Graduation Date
26-06-2024
Awarding Institution
Delft University of Technology
Project
['CSE3000 Research Project']
Programme
['Computer Science and Engineering']
Faculty
Electrical Engineering, Mathematics and Computer Science
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Federated learning (FL) enables privacy-preserving collaboration among numerous clients for training machine learning models. In FL, a server coordinates model aggregation while preserving data privacy. However, non-identically and independently distributed (non-IID) local data label distributions degrade the performance of the global model. This paper investigates the impact of synthetic data on mitigating non-IID data distributions in federated learning. We explore data-based augmentation techniques, including uniform and minority imputation, utilizing conditional variational autoencoders (CVAEs) to generate synthetic data.
Additionally, we examine a framework-based approach where a pre-trained model, centrally trained on synthetic data, is distributed to clients for finetuning on their original datasets. Our results, which use the binarized MNIST dataset, demonstrate a quality gap between synthetic and original datasets, leading to diminished classification performance when trained on only synthetic data. Integrating both original and synthetic data improves performance on heavily imbalanced label distributions. At the same time, uniform imputation experiments reveal that optimal imputation must strike a balance, with performance degradation being noticeable when datasets consist of more than 45\% synthetic images. Synthetic imputation did not suffer from degradation in the explored range of imputation amounts and achieved an average F1 score improvement of 0.015 over uniform imputation.

Files

RP_Paper-7.pdf
(pdf | 0.825 Mb)
License info not available