Cost Estimation for Factorized Machine Learning

More Info
expand_more

Abstract

In the realm of machine learning (ML), the need for efficiency in training processes is paramount. The conventional first step in an ML workflow involves collecting data from various sources and merging them into a single table, a process known as materialization, which can introduce inefficiencies caused by redundant data. Factorized ML strives to reduce this by maintaining the original data forms and performing model training on the separate source tables. This approach can lead to significant increases in training efficiency.

However, factorized training does not always reduce cost compared to traditional materialized training. This research tackles this issue by examining the multidimensional cost optimization problem that emerges when deciding between factorized and traditional materialized learning methods. It fills in gaps left by prior research, which is focused on CPU-based training, by investigating the cost estimation landscape for factorized ML, with a special emphasis on GPU performance compared to CPUs. The used factorized ML framework is expanded to incorporate GPU training, a topic not explored in previous research. We demonstrate that GPU training exhibits significantly different cost characteristics than CPU training, which has substantial implications for the design of cost models and the optimization of factorized ML.

Through an empirical study, an ML-based cost model is developed that can accurately predict the faster training method for a wide range of scenarios. On an extensive evaluation with real-world datasets this model boasts an average speedup of 3.8x, versus the state-of-the-art's 0.9x. We also show that it generalizes to scenarios with datasets and hardware settings on which the model is not trained, keeping 82% of training set performance.

Our innovative cost model for factorized ML enables significant time savings in training-intensive scenarios and further underlines the benefits of factorized training. However, effort should be invested into incorporating factorized training into existing ML frameworks so this method of training a model, and our cost model, can be evaluated in a larger set of realistic scenarios.