Implicit parallelism through deep language embedding

None, None; None, None; None, None; None, None; None, None; None, None; None, None; None, None

Implicit parallelism through deep language embedding

Conference Paper (2015)

Author(s)

Alexander Alexandrov (Technical University of Berlin)

Andreas Kunft (Technical University of Berlin)

Asterios Katsifodimos (Technical University of Berlin)

Felix Schüler (Technical University of Berlin)

Lauritz Thamsen (Technical University of Berlin)

Odej Kao (Technical University of Berlin)

Tobias Herb (Technical University of Berlin)

Volker Markl (Technical University of Berlin)

Affiliation

External organisation

DOI related publication

https://doi.org/10.1145/2723372.2750543 Final published version

To reference this document use

https://resolver.tudelft.nl/uuid:23ce0640-95a2-4deb-8591-a3b112fad4d3

More Info

expand_more

Publication Year

2015

Language

English

Affiliation

External organisation

Pages (from-to)

47-61

ISBN (print)

978-1-4503-2758-9

Event

ACM SIGMOD International Conference on Management of Data, SIGMOD 2015 (2015-05-31 - 2015-06-04), Melbourne, Australia

Downloads counter

196

Abstract

The appeal of MapReduce has spawned a family of systems that implement or extend it. In order to enable parallel collection processing with User-Defined Functions (UDFs), these systems expose extensions of the MapReduce programming model as library-based dataow APIs that are tightly coupled to their underlying runtime engine. Expressing data analysis algorithms with complex data and control ow structure using such APIs reveals a number of limitations that impede programmer's productivity. In this paper we show that the design of data analysis languages and APIs from a runtime engine point of view bloats the APIs with low-level primitives and affects programmer's productivity. Instead, we argue that an approach based on deeply embedding the APIs in a host language can address the shortcomings of current data analysis languages. To demonstrate this, we propose a language for complex data analysis embedded in Scala, which (i) allows for declarative specification of dataows and (ii) hides the notion of dataparallelism and distributed runtime behind a suitable intermediate representation. We describe a compiler pipeline that facilitates efficient data-parallel processing without imposing runtime engine-bound syntactic or semantic restrictions on the structure of the input programs. We present a series of experiments with two state-of-the-art systems that demonstrate the optimization potential of our approach.