High Throughput Parallel Computation with High Bandwidth Memory on FPGA

Master thesis (2021)

Authors

J.C. Dumont Electrical Engineering, Mathematics and Computer Science

Contributors

Z. Al-Ars Computer Engineering - (mentor)

J.J. Hoozemans (mentor)

H.P. Hofstee Computer Engineering - (graduation committee member)

Jan S. Rellermeyer Data-Intensive Systems - (graduation committee member)

Faculty

Electrical Engineering, Mathematics and Computer Science, Electrical Engineering, Mathematics and Computer Science

FPGA Parallel Decompression Snappy HBM

To reference this document use:

http://resolver.tudelft.nl/uuid:5a3b7590-bca5-46bd-ab4c-842cbe992a41

More Info

expand_more

Published Date

25-02-2021

Language

English

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Faculty

Electrical Engineering, Mathematics and Computer Science

Abstract

With the increase of available storage bandwidth, CPUs can not keep up with the compute throughput needed to process this amount of incoming data. GPUs and FPGAs are generally better suited for such tasks. To assist FPGAs in their functions, some boards are equipped with one or more high bandwidth memory (HBM) stacks, with a bandwidth of 230 GB/s each. This thesis presents a hardware design for the Alveo U280 FPGA board with HBM. Each HBM stack provides multiple interfaces to the full range of memory within HBM. Utilizing these multiple interfaces, a hardware decompressor for the Snappy compression algorithm is placed in parallel to achieve a higher end-to-end throughput. Additionally a design is created to perform benchmarks on HBM where varying sizes of data are transported between HBM and logic within the FPGA. The hardware decompressor and component that interfaces to memory within HBM were found to be incompatible and required additional logic to become able to transport data between them. To ease the parallelization of the decompressor a custom Snappy framing format is implemented. Using this format a softcore processor on the FPGA is able to buffer the locations of compressed data within HBM and divide these over available decompressors. The design is successfully synthesized into a kernel that can be loaded by the FPGA. From the moment compressed data is sitting in HBM until it is decompressed, a single decompressor reaches a maximum end-to-end throughput of 4.0 GB/s. When eight or more decompressors are activated, they reach a throughput between 20.0 to 26.2 GB/s. The hardware decompressor designs uses less than 10% of the resources of the U280 with little power usage. Compared to a software implementation, using multithreading, the hardware solution is 1.5-2.5x faster on a set of files that is used for benchmarks on decompression speed with varying compression ratios.

Files

Joep_C_Dumont_Msc_Thesis_final... (.pdf)

(.pdf | 1.93 Mb)