Go With The Flow: Fault-Tolerant Decentralized Training of Large Language Models

None, None

Go With The Flow: Fault-Tolerant Decentralized Training of Large Language Models

Decentralised Training of Large Language Models

Master Thesis (2024)

Author(s)

N. Blagoev (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

Y. Chen – Mentor (TU Delft - Data-Intensive Systems)

Jeremie Decouchant – Graduation committee member (TU Delft - Data-Intensive Systems)

Faculty

Electrical Engineering, Mathematics and Computer Science

Machine learning LLM Distributed Fault tolerant

To reference this document use:

https://resolver.tudelft.nl/uuid:c7d337b2-440e-4d74-8e39-6e6ffcce5f3b

More Info

expand_more

Publication Year

2024

Language

English

Graduation Date

24-06-2024

Awarding Institution

Delft University of Technology

Programme

['Computer Science']

Faculty

Electrical Engineering, Mathematics and Computer Science

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Motivated by the emergence of Large Language Models (LLMs) and the importance of democratizing their training, we propose Go With The Flow, the first practical decentralized training framework for LLMs. Differently from existing distributed and federated training frameworks, Go With The Flow enables the collaborative training of an LLM on a set of heterogeneous client nodes that dedicate different resources for an undefined amount of time. Our work addresses node churn, i.e., clients joining or leaving the system, and network instabilities, i.e., network links becoming unstable or unreliable. The core of Go With The Flow is a decentralized flow algorithm that finds the most effective routing to train a maximum number of microbatches with a minimum delay. We extensively evaluate our work on LLama-like and GPT-like models, compare it against the prior art and achieve up to 45\% training time reduction in realistic and challenging scenarios of heterogeneous client nodes distributed at 10 different geographic locations with a high node churn rate. We further demonstrate resilient training in such challenging environments, without sacrificing convergence.

Files

MSC_Thesis_LLM_distributed_6_.... (pdf)

(pdf | 14.4 Mb)

- Embargo expired in 31-08-2024

License info not available