An Intermediate Representation for Optimizing Machine Learning Pipelines

None, None; None, None; None, None; None, None; None, None; None, None

An Intermediate Representation for Optimizing Machine Learning Pipelines

Journal Article (2019)

Author(s)

Andreas Kunft (Technical University of Berlin)

Asterios Katsifodimos (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Sebastian Schelter (New York University)

Sebastian Bress (German Research Centre for Artificial Intelligence (DFKI))

Tilmann Rabl (University of Potsdam)

Volker Markl (German Research Centre for Artificial Intelligence (DFKI))

Research Group

Web Information Systems

DOI related publication

https://doi.org/10.14778/3342263.3342633 Final published version

To reference this document use

https://resolver.tudelft.nl/uuid:3970f98f-bcf1-4ead-93f7-2a2b20968bf8

More Info

expand_more

Publication Year

2019

Language

English

Research Group

Web Information Systems

Issue number

11

Volume number

12

Pages (from-to)

1553-1567

Downloads counter

411

Collections

Institutional Repository

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Machine learning (ML) pipelines for model training and validation typically include preprocessing, such as data cleaning and feature engineering, prior to training an ML model. Preprocessing combines relational algebra and user-defined functions (UDFs), while model training uses iterations and linear algebra. Current systems are tailored to either of the two. As a consequence, preprocessing and ML steps are optimized in isolation. To enable holistic optimization of ML training pipelines, we present Lara, a declarative domainspecific language for collections and matrices. Lara's intermediate representation (IR) re ects on the complete program, i.e., UDFs, control ow, and both data types. Two views on the IR enable diverse optimizations. Monads enable operator pushdown and fusion across type and loop boundaries. Combinators provide the semantics of domainspecific operators and optimize data access and cross-validation of ML algorithms. Our experiments on preprocessing pipelines and selected ML algorithms show the effects of our proposed optimizations on dense and sparse data, which achieve speedups of up to an order of magnitude.

Files

End_to_end_ml_pipelines.pdf

(pdf | 0.889 Mb)