High Throughput Sorting on FPGAs using High Bandwidth Memory

More Info
expand_more

Abstract

In this thesis we explore the acceleration of sorting algorithms on FPGAs using high bandwidth memory (HBM). The target application is an FPGA as an accelerator in an OpenCAPI enabled system, that enables the accelerator to access main memory of the host at a bandwidth of 25 GB/s for either read or write. We explore under what read and write access patterns the HBM bandwidth of 460.8 GB/s can be met and identify specific circumstances under which this bandwidth can be achieved. The sorting algorithm is implemented in hardware as two steps: partitioning and sorting. We design two partitioning architectures and one sorting architecture. The sorting architecture sorts buckets generated in the partitioning step and is based on merge sort. It uses HBM and wide merge trees to reduce the number of passes through a memory. The architectures themselves are to be instantiated multiple times on the FPGA to achieve a higher sorting throughput. Simulating each architecture at 225 MHz, they are all designed to output up to 3.6 GB/s of 8+8 byte key-value pairs under ideal conditions. We measure the first and second partitioning architectures and identify a bottleneck in HBM for the former, resulting in only 0.44 GB/s with a (uniformly) random input, due to a strided access pattern. With a sorted input, the throughput is 2.18 GB/s. The second partitioning architecture is not affected by this and achieves a throughput of approximately 1.7 GB/s for both types of input. The sorter performs best, sorting buckets at approximately 2.7 GB/s. Synthesis results show that the target FPGA has enough resources for tens of partitioners and sorters, allowing to create a sorting hardware that saturates system bandwidth.