Cost Estimation for Factorized Machine Learning

None, None

Cost Estimation for Factorized Machine Learning

Master Thesis (2024)

Author(s)

P.H. te Marvelde (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

R. Hai – Mentor (TU Delft - Web Information Systems)

Wenbo Sun – Mentor (TU Delft - Web Information Systems)

Asterios Katsifodimos – Graduation committee member (TU Delft - Data-Intensive Systems)

Soham Chakraborty – Graduation committee member (TU Delft - Programming Languages)

Faculty

Electrical Engineering, Mathematics and Computer Science

Machine learning Cost estimation Factorized learning

To reference this document use:

https://resolver.tudelft.nl/uuid:b4255726-12e3-466d-96a2-0ab1421c20a3

More Info

expand_more

Publication Year

2024

Language

English

Graduation Date

22-04-2024

Awarding Institution

Delft University of Technology

Programme

['Computer Science | Web Information Systems']

Faculty

Electrical Engineering, Mathematics and Computer Science

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

In the realm of machine learning (ML), the need for efficiency in training processes is paramount. The conventional first step in an ML workflow involves collecting data from various sources and merging them into a single table, a process known as materialization, which can introduce inefficiencies caused by redundant data. Factorized ML strives to reduce this by maintaining the original data forms and performing model training on the separate source tables. This approach can lead to significant increases in training efficiency.

However, factorized training does not always reduce cost compared to traditional materialized training. This research tackles this issue by examining the multidimensional cost optimization problem that emerges when deciding between factorized and traditional materialized learning methods. It fills in gaps left by prior research, which is focused on CPU-based training, by investigating the cost estimation landscape for factorized ML, with a special emphasis on GPU performance compared to CPUs. The used factorized ML framework is expanded to incorporate GPU training, a topic not explored in previous research. We demonstrate that GPU training exhibits significantly different cost characteristics than CPU training, which has substantial implications for the design of cost models and the optimization of factorized ML.

Through an empirical study, an ML-based cost model is developed that can accurately predict the faster training method for a wide range of scenarios. On an extensive evaluation with real-world datasets this model boasts an average speedup of 3.8x, versus the state-of-the-art's 0.9x. We also show that it generalizes to scenarios with datasets and hardware settings on which the model is not trained, keeping 82% of training set performance.

Our innovative cost model for factorized ML enables significant time savings in training-intensive scenarios and further underlines the benefits of factorized training. However, effort should be invested into incorporating factorized training into existing ML frameworks so this method of training a model, and our cost model, can be evaluated in a larger set of realistic scenarios.

Files

MSc-thesis-Pepijn-te-Marvelde.... (pdf)

(pdf | 2.64 Mb)

License info not available