Large Scale In-Database Machine Learning Using Cloud Native Workflows

More Info
expand_more

Abstract

During the last decade, the proliferation of smartphones, social media and streaming services has provoked an explosion of multimedia data. This large amount of image and video sources combined with more powerful and inexpensive computational capabilities brought by the cloud computing paradigm has facilitated the rapid growth of new machine learning models capable of extracting information faster and more accurately. However, the complexity to develop machine learning models has also grown, involving multiple steps, from the acquisition and preparation of data to the training, evaluation and deployment of models. To alleviate this, the leading database providers have started to integrate the predictive capabilities of machine learning directly into their systems. This new approach is known as in-database machine learning, and it brings new interesting properties such as the exploitation of the inherent relational structure of data or the preservation of its privacy and integrity since the inference occurs directly where the data lives. In this work, we present a cloud-native approach to perform in-database machine learning. We have extended SQLFlow, a bridge between SQL engines and machine learning toolkits to support models trained to solve image recognition tasks over image datasets, which meta-information is persisted on a relational database. Furthermore, we have encapsulated the definition of machine learning models on cloud-native workflows that are able to exploit the GPU resources available in a Kubernetes environment. Our research evaluates the scalability of the proposed system regarding the total execution time and GPU utilization. Besides, we are interested in exploring the design of optimized machine learning query plans, where the goal is to choose the optimal among multiple models that cover a range of specific classes to predict from attending its accuracy and execution cost. For that purpose, we have implemented a model repository containing different model variations and evaluated different strategies to optimize the model selection. Our experiments show that optimizing the model selection will lead to more accurate and faster results, especially when a query covers a high number of classes and the number of models that are able to answer them is limited.