Implicit Parallelism through Deep Language Embedding

None, None; None, None; None, None; None, None

Implicit Parallelism through Deep Language Embedding

Journal Article (2016)

Author(s)

Alexander Alexandrov (Technical University of Berlin)

A. Katsifodimos (Technical University of Berlin)

Georgi Krastev (Technical University of Berlin)

Volker Markl (Technical University of Berlin)

Affiliation

External organisation

DOI related publication

https://doi.org/10.1145/2949741.2949754

To reference this document use:

https://resolver.tudelft.nl/uuid:b6ca0849-9ebf-41de-9d4d-1f5636a499ed

More Info

expand_more

Publication Year

2016

Language

English

Affiliation

External organisation

Issue number

1

Volume number

45

Pages (from-to)

51-58

Abstract

Parallel collection processing based on second-order functions such as map and reduce has been widely adopted for scalable data analysis. Initially popularized by Google, over the past decade this programming paradigm has found its way in the core APIs of parallel dataflow engines such as Hadoop's MapReduce, Spark's RDDs, and Flink's DataSets. We review programming patterns typical of these APIs and discuss how they relate to the underlying parallel execution model. We argue that fixing the abstraction leaks exposed by these patterns will reduce the cost of data analysis due to improved programmer productivity. To achieve that, we first revisit the algebraic foundations of parallel collection processing. Based on that, we propose a simplified API that (i) provides proper support for nested collection processing and (ii) alleviates the need of certain second-order primitives through comprehensions - a declarative syntax akin to SQL. Finally, we present a metaprogramming pipeline that performs algebraic rewrites and physical optimizations which allow us to target parallel dataflow engines like Spark and Flink with competitive performance.

No files available

Metadata only record. There are no files for this record.