Print Email Facebook Twitter Go With The Flow: Fault-Tolerant Decentralized Training of Large Language Models Title Go With The Flow: Fault-Tolerant Decentralized Training of Large Language Models: Decentralised Training of Large Language Models Author Blagoev, Nikolay (TU Delft Electrical Engineering, Mathematics and Computer Science) Contributor Chen, Lydia Y. (mentor) Decouchant, Jérémie (graduation committee) Degree granting institution Delft University of Technology Programme Computer Science Date 2024-06-24 Abstract Motivated by the emergence of Large Language Models (LLMs) and the importance of democratizing their training, we propose Go With The Flow, the first practical decentralized training framework for LLMs. Differently from existing distributed and federated training frameworks, Go With The Flow enables the collaborative training of an LLM on a set of heterogeneous client nodes that dedicate different resources for an undefined amount of time. Our work addresses node churn, i.e., clients joining or leaving the system, and network instabilities, i.e., network links becoming unstable or unreliable. The core of Go With The Flow is a decentralized flow algorithm that finds the most effective routing to train a maximum number of microbatches with a minimum delay. We extensively evaluate our work on LLama-like and GPT-like models, compare it against the prior art and achieve up to 45\% training time reduction in realistic and challenging scenarios of heterogeneous client nodes distributed at 10 different geographic locations with a high node churn rate. We further demonstrate resilient training in such challenging environments, without sacrificing convergence. Subject Fault tolerantDistributedMachine learningLLM To reference this document use: http://resolver.tudelft.nl/uuid:c7d337b2-440e-4d74-8e39-6e6ffcce5f3b Embargo date 2024-08-31 Part of collection Student theses Document type master thesis Rights © 2024 Nikolay Blagoev Files file embargo until 2024-08-31