KubeML: An Efficient Serverless Platform for Scalable Deep Learning

None, None

KubeML: An Efficient Serverless Platform for Scalable Deep Learning

Master Thesis (2021)

Author(s)

D. Albo Martinez (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

Jan S. Rellermeyer – Mentor (TU Delft - Data-Intensive Systems)

D.H.J. Epema – Graduation committee member (TU Delft - Data-Intensive Systems)

A Katsifodimos – Graduation committee member (TU Delft - Web Information Systems)

Faculty

Electrical Engineering, Mathematics and Computer Science

Copyright

Deep Learning Distributed Systems Serverless Computing

To reference this document use:

https://resolver.tudelft.nl/uuid:5fe8d907-a98c-4364-b594-69ebb044767e

More Info

expand_more

Publication Year

2021

Language

English

Copyright

Graduation Date

17-06-2021

Awarding Institution

Delft University of Technology

Faculty

Electrical Engineering, Mathematics and Computer Science

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Serverless computing is an emerging paradigm for structuring applications in such a way that they can benefit from on-demand computing resources and achieve horizontal scalability. As such, it is an ideal substrate for the resource-intensive and often ad-hoc task of training deep learning models. However, the design and stateless nature of serverless platforms make it difficult to translate distributed machine learning systems directly to this new world. With KubeML, we present a purpose-built serverless machine learning system that runs atop Kubernetes and seamlessly embeds into the popular PyTorch framework. Unlike alternative systems, KubeML fully embraces GPU acceleration and is able to outperform TensorFlow, especially with smaller local batches, while allowing for higher resource density. KubeML reaches a 3.98x faster time-to-accuracy with small batch sizes, and maintains a 2.02x speedup between the top results of both platforms for commonly benchmarked machine learning models like ResNet34.

Files

Diegoalbo_kubeml_report.pdf

(pdf | 4.23 Mb)

License info not available