KubeML: An Efficient Serverless Platform for Scalable Deep Learning

Master Thesis (2021)
Author(s)

D. Albo Martinez (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

Jan S. Rellermeyer – Mentor (TU Delft - Data-Intensive Systems)

Dick H.J. Epema – Graduation committee member (TU Delft - Data-Intensive Systems)

A Katsifodimos – Graduation committee member (TU Delft - Web Information Systems)

Faculty
Electrical Engineering, Mathematics and Computer Science
Copyright
© 2021 Diego Albo Martinez
More Info
expand_more
Publication Year
2021
Language
English
Copyright
© 2021 Diego Albo Martinez
Graduation Date
17-06-2021
Awarding Institution
Delft University of Technology
Faculty
Electrical Engineering, Mathematics and Computer Science
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Serverless computing is an emerging paradigm for structuring applications in such a way that they can benefit from on-demand computing resources and achieve horizontal scalability. As such, it is an ideal substrate for the resource-intensive and often ad-hoc task of training deep learning models. However, the design and stateless nature of serverless platforms make it difficult to translate distributed machine learning systems directly to this new world. With KubeML, we present a purpose-built serverless machine learning system that runs atop Kubernetes and seamlessly embeds into the popular PyTorch framework. Unlike alternative systems, KubeML fully embraces GPU acceleration and is able to outperform TensorFlow, especially with smaller local batches, while allowing for higher resource density. KubeML reaches a 3.98x faster time-to-accuracy with small batch sizes, and maintains a 2.02x speedup between the top results of both platforms for commonly benchmarked machine learning models like ResNet34.

Files

Diegoalbo_kubeml_report.pdf
(pdf | 4.23 Mb)
License info not available