Implicit parallelism through deep language embedding
Alexander Alexandrov (Technical University of Berlin)
Andreas Kunft (Technical University of Berlin)
Asterios Katsifodimos (Technical University of Berlin)
Felix Schüler (Technical University of Berlin)
Lauritz Thamsen (Technical University of Berlin)
Odej Kao (Technical University of Berlin)
Tobias Herb (Technical University of Berlin)
Volker Markl (Technical University of Berlin)
More Info
expand_more
Abstract
The appeal of MapReduce has spawned a family of systems that implement or extend it. In order to enable parallel collection processing with User-Defined Functions (UDFs), these systems expose extensions of the MapReduce programming model as library-based dataow APIs that are tightly coupled to their underlying runtime engine. Expressing data analysis algorithms with complex data and control ow structure using such APIs reveals a number of limitations that impede programmer's productivity. In this paper we show that the design of data analysis languages and APIs from a runtime engine point of view bloats the APIs with low-level primitives and affects programmer's productivity. Instead, we argue that an approach based on deeply embedding the APIs in a host language can address the shortcomings of current data analysis languages. To demonstrate this, we propose a language for complex data analysis embedded in Scala, which (i) allows for declarative specification of dataows and (ii) hides the notion of dataparallelism and distributed runtime behind a suitable intermediate representation. We describe a compiler pipeline that facilitates efficient data-parallel processing without imposing runtime engine-bound syntactic or semantic restrictions on the structure of the input programs. We present a series of experiments with two state-of-the-art systems that demonstrate the optimization potential of our approach.