Federated learning from non-iid data

Improving accuracy through data-augmentation and communication efficiency

More Info
expand_more

Abstract

Federated learning allows multiple parties to collaboratively develop a deep learning model, without sharing private data. Models can be generated from the most up-to-date data while taking unique and not publicly available data into account. However, the distributed nature of federated learning causes problems too, and clients are not guaranteed to hold independently identically distributed (iid) data, causing performance degradation.

This work analyzes existing methods of generating such skewed datasets and finds that the Earth Movers Distance (EMD) can be used to compare them. A novel scheme called phase-shift is introduced, which allows clients to communicate more frequently, without increasing communication, hereby reducing drift caused by non-iid data. Finally, we propose a data-driven approach that can reduce the data skew by supplementing local datasets with augmented data. A novel method of balancing unaltered and augmented data is introduced, taking the skew of the dataset into account.

Empirical analysis shows that phase-shift can reduce the instantaneous communication load on the system by 37.5% without suffering a performance loss or reducing convergence rate. Evaluation of data augmentation on a heavily skewed cifar10 dataset shows that accuracy is improved by 10%. Finally, phase-shift and data augmentation are combined, resulting in a 13% accuracy improvement, surpassing algorithms such as FedNova and FedProx when dealing with label-heterogeneity.