Go With The Flow: Fault-Tolerant Decentralized Training of Large Language Models: Decentralised Training of Large Language Models

Blagoev, Nikolay

Go With The Flow: Fault-Tolerant Decentralized Training of Large Language Models

Title

Go With The Flow: Fault-Tolerant Decentralized Training of Large Language Models: Decentralised Training of Large Language Models

Author

Blagoev, Nikolay (TU Delft Electrical Engineering, Mathematics and Computer Science)

Contributor

Chen, Lydia Y. (mentor)
Decouchant, Jérémie (graduation committee)

Degree granting institution

Delft University of Technology

Programme

Computer Science

Date

2024-06-24

Abstract

Motivated by the emergence of Large Language Models (LLMs) and the importance of democratizing their training, we propose Go With The Flow, the first practical decentralized training framework for LLMs. Differently from existing distributed and federated training frameworks, Go With The Flow enables the collaborative training of an LLM on a set of heterogeneous client nodes that dedicate different resources for an undefined amount of time. Our work addresses node churn, i.e., clients joining or leaving the system, and network instabilities, i.e., network links becoming unstable or unreliable. The core of Go With The Flow is a decentralized flow algorithm that finds the most effective routing to train a maximum number of microbatches with a minimum delay. We extensively evaluate our work on LLama-like and GPT-like models, compare it against the prior art and achieve up to 45\% training time reduction in realistic and challenging scenarios of heterogeneous client nodes distributed at 10 different geographic locations with a high node churn rate. We further demonstrate resilient training in such challenging environments, without sacrificing convergence.

Subject

Fault tolerant
Distributed
Machine learning
LLM

To reference this document use:

http://resolver.tudelft.nl/uuid:c7d337b2-440e-4d74-8e39-6e6ffcce5f3b

Embargo date

2024-08-31

Part of collection

Student theses

Document type

master thesis

Rights

Files

file embargo until 2024-08-31