Cross-Facility Federated Learning

Journal Article (2024)

Authors

Iacopo Colonnelli (University of Turin)

Robert Birke (University of Turin)

Giulio Malenza (University of Turin)

Gianluca Mittone (University of Turin)

Alberto Mulone (University of Turin)

J.M. Galjaard (TU Delft - Data-Intensive Systems)

Lydia Y. Chen (TU Delft - Data-Intensive Systems)

Sanzio Bassini (CINECA)

Gabriella Scipione (CINECA)

G.B. More Authors (External organisation)

Research Group

Data-Intensive Systems

Federated Learning High-Performance Computing Cross-Facility Computing Hybrid Workflows StreamFlow

To reference this document use:

https://doi.org/10.1016/j.procs.2024.07.003

TU Delft Repository resolver:

https://resolver.tudelft.nl/3480eccb-bd31-4f56-919b-dff92b2430cc

More Info

expand_more

Publication Year

2024

Language

English

Research Group

Data-Intensive Systems

Volume number

240

Pages (from-to)

3-12

DOI:

https://doi.org/10.1016/j.procs.2024.07.003

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

In a decade, AI frontier research transitioned from the researcher's workstation to thousands of high-end hardware-accelerated compute nodes. This rapid evolution shows no signs of slowing down in the foreseeable future. While top cloud providers may be able to keep pace with this growth rate, obtaining and efficiently exploiting computing resources at that scale is a daunting challenge for universities and SMEs. This work introduces the Cross-Facility Federated Learning (XFFL) framework to bridge this compute divide, extending the opportunity to efficiently exploit multiple independent data centres for extreme-scale deep learning tasks to data scientists and domain experts. XFFL relies on hybrid workflow abstractions to decouple tasks from environment-specific technicalities, reducing complexity and enhancing reusability. In addition, Federated Learning (FL) algorithms eliminate the need to move large amounts of data between different facilities, reducing time-to-solution and preserving data privacy. The XFFL approach is empirically evaluated by training a full LLaMAv2 7B instance on two facilities of the EuroHPC JU, showing how the increased computing power completely compensates for the additional overhead introduced by two data centres.

Files

1-s2.0-S1877050924016909-main.... (pdf)

(pdf | 0.55 Mb)