KubeML: An Efficient Serverless Platform for Scalable Deep Learning
D. Albo Martinez (TU Delft - Electrical Engineering, Mathematics and Computer Science)
Jan S. Rellermeyer – Mentor (TU Delft - Data-Intensive Systems)
Dick H.J. Epema – Graduation committee member (TU Delft - Data-Intensive Systems)
A Katsifodimos – Graduation committee member (TU Delft - Web Information Systems)
More Info
expand_more
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.
Abstract
Serverless computing is an emerging paradigm for structuring applications in such a way that they can benefit from on-demand computing resources and achieve horizontal scalability. As such, it is an ideal substrate for the resource-intensive and often ad-hoc task of training deep learning models. However, the design and stateless nature of serverless platforms make it difficult to translate distributed machine learning systems directly to this new world. With KubeML, we present a purpose-built serverless machine learning system that runs atop Kubernetes and seamlessly embeds into the popular PyTorch framework. Unlike alternative systems, KubeML fully embraces GPU acceleration and is able to outperform TensorFlow, especially with smaller local batches, while allowing for higher resource density. KubeML reaches a 3.98x faster time-to-accuracy with small batch sizes, and maintains a 2.02x speedup between the top results of both platforms for commonly benchmarked machine learning models like ResNet34.