Effect of Different Data Augmentation Strategies on Performance In Federated Learning Systems

None, None

Effect of Different Data Augmentation Strategies on Performance In Federated Learning Systems

Bachelor Thesis (2024)

Author(s)

L.S. Yadala Chanchu (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

Swier Garst – Mentor (TU Delft - Pattern Recognition and Bioinformatics)

David M. J. Tax – Mentor (TU Delft - Pattern Recognition and Bioinformatics)

Alexios Voulimeneas – Graduation committee member (TU Delft - Cyber Security)

Faculty

Electrical Engineering, Mathematics and Computer Science

Federated Learning Transfer learning MNIST Synthetic data Privacy-preserving collaboration Non-IID data distributions Data-based augmentation Conditional variational autoencoders Variational autoencoders

To reference this document use:

https://resolver.tudelft.nl/uuid:11946811-3752-4201-96f1-b0abd9cdc9dd

More Info

expand_more

Publication Year

2024

Language

English

Graduation Date

26-06-2024

Awarding Institution

Delft University of Technology

Project

['CSE3000 Research Project']

Programme

['Computer Science and Engineering']

Faculty

Electrical Engineering, Mathematics and Computer Science

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Federated learning (FL) enables privacy-preserving collaboration among numerous clients for training machine learning models. In FL, a server coordinates model aggregation while preserving data privacy. However, non-identically and independently distributed (non-IID) local data label distributions degrade the performance of the global model. This paper investigates the impact of synthetic data on mitigating non-IID data distributions in federated learning. We explore data-based augmentation techniques, including uniform and minority imputation, utilizing conditional variational autoencoders (CVAEs) to generate synthetic data.
Additionally, we examine a framework-based approach where a pre-trained model, centrally trained on synthetic data, is distributed to clients for finetuning on their original datasets. Our results, which use the binarized MNIST dataset, demonstrate a quality gap between synthetic and original datasets, leading to diminished classification performance when trained on only synthetic data. Integrating both original and synthetic data improves performance on heavily imbalanced label distributions. At the same time, uniform imputation experiments reveal that optimal imputation must strike a balance, with performance degradation being noticeable when datasets consist of more than 45\% synthetic images. Synthetic imputation did not suffer from degradation in the explored range of imputation amounts and achieved an average F1 score improvement of 0.015 over uniform imputation.

Files

RP_Paper-7.pdf

(pdf | 0.825 Mb)

License info not available