Go With The Flow: Fault-Tolerant Decentralized Training of Large Language Models
Decentralised Training of Large Language Models
N. Blagoev (TU Delft - Electrical Engineering, Mathematics and Computer Science)
Y. Chen – Mentor (TU Delft - Data-Intensive Systems)
Jeremie Decouchant – Graduation committee member (TU Delft - Data-Intensive Systems)
More Info
expand_more
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.
Abstract
Motivated by the emergence of Large Language Models (LLMs) and the importance of democratizing their training, we propose Go With The Flow, the first practical decentralized training framework for LLMs. Differently from existing distributed and federated training frameworks, Go With The Flow enables the collaborative training of an LLM on a set of heterogeneous client nodes that dedicate different resources for an undefined amount of time. Our work addresses node churn, i.e., clients joining or leaving the system, and network instabilities, i.e., network links becoming unstable or unreliable. The core of Go With The Flow is a decentralized flow algorithm that finds the most effective routing to train a maximum number of microbatches with a minimum delay. We extensively evaluate our work on LLama-like and GPT-like models, compare it against the prior art and achieve up to 45\% training time reduction in realistic and challenging scenarios of heterogeneous client nodes distributed at 10 different geographic locations with a high node churn rate. We further demonstrate resilient training in such challenging environments, without sacrificing convergence.