High Throughput Parallel Computation with High Bandwidth Memory on FPGA

More Info
expand_more

Abstract

With the increase of available storage bandwidth, CPUs can not keep up with the compute throughput needed to process this amount of incoming data. GPUs and FPGAs are generally better suited for such tasks. To assist FPGAs in their functions, some boards are equipped with one or more high bandwidth memory (HBM) stacks, with a bandwidth of 230 GB/s each. This thesis presents a hardware design for the Alveo U280 FPGA board with HBM. Each HBM stack provides multiple interfaces to the full range of memory within HBM. Utilizing these multiple interfaces, a hardware decompressor for the Snappy compression algorithm is placed in parallel to achieve a higher end-to-end throughput. Additionally a design is created to perform benchmarks on HBM where varying sizes of data are transported between HBM and logic within the FPGA. The hardware decompressor and component that interfaces to memory within HBM were found to be incompatible and required additional logic to become able to transport data between them. To ease the parallelization of the decompressor a custom Snappy framing format is implemented. Using this format a softcore processor on the FPGA is able to buffer the locations of compressed data within HBM and divide these over available decompressors. The design is successfully synthesized into a kernel that can be loaded by the FPGA. From the moment compressed data is sitting in HBM until it is decompressed, a single decompressor reaches a maximum end-to-end throughput of 4.0 GB/s. When eight or more decompressors are activated, they reach a throughput between 20.0 to 26.2 GB/s. The hardware decompressor designs uses less than 10% of the resources of the U280 with little power usage. Compared to a software implementation, using multithreading, the hardware solution is 1.5-2.5x faster on a set of files that is used for benchmarks on decompression speed with varying compression ratios.