SpArrow: A Spark-Arrow Engine
Leveraging the Arrow in-memory columnar format to increase Spark efficiency in RDD computations
F. Fiorini (TU Delft - Electrical Engineering, Mathematics and Computer Science)
Jan S. Rellermeyer – Mentor (TU Delft - Data-Intensive Systems)
Z. Al-Ars – Graduation committee member (TU Delft - Computer Engineering)
J.A. Pouwelse – Coach (TU Delft - Data-Intensive Systems)
More Info
expand_more
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.
Abstract
The ever-increasing amount of data being generated worldwide, combined with the business advantages for companies in quickly and efficiently processing such data, have accelerated the research and development into big data analytics. Existing solutions for storing and computing data can no longer give the required processing performance, and technical advancements in I/O operations and networking have further increased the distance between the requirements and the provided solutions. A number of changes, such as the introduction of new memory representations, have been proposed to reduce the overhead of existing data analytics frameworks, but they still have not been fully integrated with such frameworks for a lack of market traction.Apache Spark is among the most widely used frameworks for big data analytics, as it provides, among other functionalities, an extensive and easy-to-use API that allows to perform computations on amultitude of processing workloads.At the core of Spark sits the Resilient Distributed Dataset, an abstraction upon which other abstraction or workload-specific libraries have been built over the years. While the general development effortsfocused on creating new abstractions on top of the core library, not enough attention has been put into solving intrinsic issues of the RDD’s.Thereby, this work proposes an integration between Apache Spark and the Apache Arrowin-memory data format, that can be leveraged to solve memory- and computational bottlenecks. Such integration is beneficial for a number of reasons, such as the reduction of serialization and storage overheads, thanks to an efficient columnar-oriented data format shared among several machines in a cluster. This work proves the feasibility of such approach, introducing a number of changes in Spark and Arrow that result in a proof of concept implementation. Furthermore, it proves that such integration can be performed without the need for expensive and disruptive changes to the existing API’s, thus allowing existing workloads to be fully compatible withnew, Arrow-based techniques. The performance advantages have been evaluated, resulting in an execution time speedup of approximately 14%, with the biggest improvement being 50% reduction in execution time for wide transformations. In addition, it allows for functionalities typically executed within Spark to be offloaded to Arrow, introducing further performance improvements, with a 20% reduction in execution timefor offloaded functionalities, compared to pure Spark.