Transparently Accelerating Spark SQL Code on Computing Hardware

Master Thesis (2020)
Author(s)

F.M. Nonnenmacher (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

Z. Al-Ars – Mentor (TU Delft - Computer Engineering)

Peter Peter Hofstee – Graduation committee member (TU Delft - Computer Engineering)

C. Hauff – Graduation committee member (TU Delft - Web Information Systems)

J.J. Hoozemans – Graduation committee member (TU Delft - Computer Engineering)

Faculty
Electrical Engineering, Mathematics and Computer Science
Copyright
© 2020 F.M. Nonnenmacher
More Info
expand_more
Publication Year
2020
Language
English
Copyright
© 2020 F.M. Nonnenmacher
Graduation Date
19-08-2020
Awarding Institution
Delft University of Technology
Faculty
Electrical Engineering, Mathematics and Computer Science
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Through new digital business models, the importance of big data analytics continuously grows. Initially, data analytics clusters were mainly bounded by the throughput of network links and the performance of I/O operations. With current hardware development, this has changed, and often the performance of CPUs and memory access became the new limiting factor. Heterogeneous computing systems, consisting of CPUs and other computing hardware, such as GPUs and FPGAs, try to overcome this by offloading the computational work to the best suitable hardware.

Accelerating the computation by offloading work to special computing hardware often requires specialized knowledge and extensive effort. In contrast, Apache Spark became one of the most used data analytics tools, among other reasons, because of its user-friendly API. Notably, the component Spark SQL allows defining declarative queries without having to write any code. The present work investigates to reduce this gap and elaborates on how Spark SQL's internal information can be used to offload computations without the user having to configure Spark further.

Thereby, the present work uses the Apache Arrow in-memory format to exchange data efficiently between different accelerators. It evaluates Spark SQL's extensibility for providing custom acceleration and its new columnar processing function, including the compatibility with the Apache Arrow format. Furthermore, the present work demonstrates the technical feasibility of such an acceleration by providing a Proof-of-Concept implementation, which integrates Spark with tools from the Arrow ecosystem, such as Gandiva and Fletcher. Gandiva uses modern CPUs' SIMD capabilities to accelerate computations, and Fletcher allows the execution of FPGA-accelerated computations. Finally, the present work demonstrates that already for simple computations integrating these accelerators led to significant performance improvements. With Gandiva the computation became 1.27 times faster and with Fletcher even up-to 13 times.

Files

Nonnenmacher_MSc_Thesis.pdf
(pdf | 2.76 Mb)
License info not available