Bridging the Gap

Towards optimization across linear and relational Algebra

Conference Paper (2016)
Author(s)

Andreas Kunft (Technical University of Berlin)

Alexander Alexandrov (Technical University of Berlin)

Asterios Katsifodimos (Technical University of Berlin)

Volker Markl (Technical University of Berlin)

Affiliation
External organisation
DOI related publication
https://doi.org/10.1145/2926534.2926540 Final published version
More Info
expand_more
Publication Year
2016
Language
English
Affiliation
External organisation
Article number
a1
ISBN (print)
9781450343114
Event
3rd ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond, BeyondMR 2016, co-located with ACM SIGMOD 2016 (2016-06-26 - 2016-07-01), San Francisco, United States
Downloads counter
156

Abstract

Advanced data analysis typically requires some form of preprocessing in order to extract and transform data before processing it with machine learning and statistical analysis techniques. Pre-processing pipelines are naturally expressed in dataflow APIs (e.g., MapReduce, Flink, etc.), while machine learning is expressed in linear algebra with iterations. Programmers therefore perform end-to-end data analysis utilizing multiple programming paradigms and systems. This impedance mismatch not only hinders productivity but also prevents optimization opportunities, such as sharing of physical data layouts (e.g., partitioning) and data structures among different parts of a data analysis program. The goal of this work is twofold. First, it aims to alleviate the impedance mismatch by allowing programmers to author complete end-to-end programs in one engine-independent language that is automatically parallelized. Second, it aims to enable joint optimizations over both relational and linear algebra. To achieve this goal, we present the design of Lara, a deeply embedded language in Scala which enables authoring scalable programs using two abstract data types (DataBag and Matrix) and control flow constructs. Programs written in Lara are compiled to an intermediate representation (IR) which enables optimizations across linear and relational algebra. The IR is finally used to compile code for different execution engines.