FPGAs in Big Data

On the transparent and efficient acceleration of big data frameworks

Master Thesis (2021)
Author(s)

B.D.A. Luppes (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

Z Al-Ars – Mentor (TU Delft - Computer Engineering)

M. Brobbel – Graduation committee member (Teratide)

Asterios Katsifodimos – Coach (TU Delft - Web Information Systems)

Faculty
Electrical Engineering, Mathematics and Computer Science
Copyright
© 2021 Bob Luppes
More Info
expand_more
Publication Year
2021
Language
English
Copyright
© 2021 Bob Luppes
Graduation Date
28-07-2021
Awarding Institution
Delft University of Technology
Faculty
Electrical Engineering, Mathematics and Computer Science
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

The increasing volume and latency requirements of big data impose challenges on the processing capacity of existing computing systems. FPGA accelerators can be leveraged to overcome these challenges, but questions remain as to how these accelerators are best deployed to accelerate big data frameworks. This work investigates how future big data frameworks should be designed such that they facilitate transparent and efficient integration with FPGA accelerators in the context of SQL workloads. Three big data frameworks are accelerated to answer this question. These implementations offload the evaluation of a regular expression filter to Tidre, an FPGA-based regular expression matcher. First, Dremio is accelerated to obtain a 1750x speedup of the filter operator. Second, Dask is accelerated and a speedup of 92x is achieved for the same operator. Lastly, an accelerated version of Dask distributed is implemented. This implementation is deployed in a cluster environment to attain an end-to-end speedup of 3.6x as well as a reduction of the total cost per query by 23%. It is identified that query exploration phases could be used to increase the impact of existing FPGA kernels. In addition, it is found that the batch size of batch-processing frameworks has a significant impact on the performance of FPGA accelerated big data frameworks in distributed setups. Tuning this batch size can increase end-to-end query throughput up to 2.1x. Finally, future big data frameworks should make use of hardware-friendly and language-independent in-memory formats such as Apache Arrow. Correct alignment of the memory buffers of this data format can further increase the throughput of the system by 2.5x.

Files

License info not available