Large Scale In-Database Machine Learning Using Cloud Native Workflows

Master thesis (2021)

Authors

H. Ballega Fernandez Electrical Engineering, Mathematics and Computer Science

Contributors

A Katsifodimos Web Information Systems - (mentor)

D.H.J. Epema Data-Intensive Systems - (mentor)

M. Fragkoulis Web Information Systems - (graduation committee member)

Faculty

Electrical Engineering, Mathematics and Computer Science, Electrical Engineering, Mathematics and Computer Science

To reference this document use:

http://resolver.tudelft.nl/uuid:2a1915e9-6bf7-4c17-aa3c-fc8feecd6400

More Info

expand_more

Published Date

30-08-2021

Language

English

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Faculty

Electrical Engineering, Mathematics and Computer Science

Abstract

During the last decade, the proliferation of smartphones, social media and streaming services has provoked an explosion of multimedia data. This large amount of image and video sources combined with more powerful and inexpensive computational capabilities brought by the cloud computing paradigm has facilitated the rapid growth of new machine learning models capable of extracting information faster and more accurately. However, the complexity to develop machine learning models has also grown, involving multiple steps, from the acquisition and preparation of data to the training, evaluation and deployment of models. To alleviate this, the leading database providers have started to integrate the predictive capabilities of machine learning directly into their systems. This new approach is known as in-database machine learning, and it brings new interesting properties such as the exploitation of the inherent relational structure of data or the preservation of its privacy and integrity since the inference occurs directly where the data lives. In this work, we present a cloud-native approach to perform in-database machine learning. We have extended SQLFlow, a bridge between SQL engines and machine learning toolkits to support models trained to solve image recognition tasks over image datasets, which meta-information is persisted on a relational database. Furthermore, we have encapsulated the definition of machine learning models on cloud-native workflows that are able to exploit the GPU resources available in a Kubernetes environment. Our research evaluates the scalability of the proposed system regarding the total execution time and GPU utilization. Besides, we are interested in exploring the design of optimized machine learning query plans, where the goal is to choose the optimal among multiple models that cover a range of specific classes to predict from attending its accuracy and execution cost. For that purpose, we have implemented a model repository containing different model variations and evaluated different strategies to optimize the model selection. Our experiments show that optimizing the model selection will lead to more accurate and faster results, especially when a query covers a high number of classes and the number of models that are able to answer them is limited.

Files

Hballega_msc_thesis.pdf

(.pdf | 12.6 Mb)