Print Email Facebook Twitter Transparently Accelerating Spark SQL Code on Computing Hardware Title Transparently Accelerating Spark SQL Code on Computing Hardware Author Nonnenmacher, F.M. (TU Delft Electrical Engineering, Mathematics and Computer Science) Contributor Al-Ars, Z. (mentor) Hofstee, H.P. (graduation committee) Hauff, C. (graduation committee) Hoozemans, J.J. (graduation committee) Degree granting institution Delft University of Technology Date 2020-08-19 Abstract Through new digital business models, the importance of big data analytics continuously grows. Initially, data analytics clusters were mainly bounded by the throughput of network links and the performance of I/O operations. With current hardware development, this has changed, and often the performance of CPUs and memory access became the new limiting factor. Heterogeneous computing systems, consisting of CPUs and other computing hardware, such as GPUs and FPGAs, try to overcome this by offloading the computational work to the best suitable hardware.Accelerating the computation by offloading work to special computing hardware often requires specialized knowledge and extensive effort. In contrast, Apache Spark became one of the most used data analytics tools, among other reasons, because of its user-friendly API. Notably, the component Spark SQL allows defining declarative queries without having to write any code. The present work investigates to reduce this gap and elaborates on how Spark SQL's internal information can be used to offload computations without the user having to configure Spark further.Thereby, the present work uses the Apache Arrow in-memory format to exchange data efficiently between different accelerators. It evaluates Spark SQL's extensibility for providing custom acceleration and its new columnar processing function, including the compatibility with the Apache Arrow format. Furthermore, the present work demonstrates the technical feasibility of such an acceleration by providing a Proof-of-Concept implementation, which integrates Spark with tools from the Arrow ecosystem, such as Gandiva and Fletcher. Gandiva uses modern CPUs' SIMD capabilities to accelerate computations, and Fletcher allows the execution of FPGA-accelerated computations. Finally, the present work demonstrates that already for simple computations integrating these accelerators led to significant performance improvements. With Gandiva the computation became 1.27 times faster and with Fletcher even up-to 13 times. Subject FPGAApache ArrowApache Sparkheterogeneous computingFletcherSpark SQL To reference this document use: http://resolver.tudelft.nl/uuid:f588ca1d-e4ae-4bf4-96ed-221d483b559d Part of collection Student theses Document type master thesis Rights © 2020 F.M. Nonnenmacher Files PDF Nonnenmacher_MSc_Thesis.pdf 2.76 MB Close viewer /islandora/object/uuid:f588ca1d-e4ae-4bf4-96ed-221d483b559d/datastream/OBJ/view