High Throughput Sorting on FPGAs using High Bandwidth Memory

Master thesis (2020)

Authors

B. Feenstra Electrical Engineering, Mathematics and Computer Science

Contributors

Z. Al-Ars Computer Engineering - (mentor)

K.L.M. Bertels FTQC/Bertels Lab (graduation committee member)

H.P. Hofstee (graduation committee member)

J. Fang Computer Engineering - (graduation committee member)

Faculty

Electrical Engineering, Mathematics and Computer Science, Electrical Engineering, Mathematics and Computer Science

To reference this document use:

http://resolver.tudelft.nl/uuid:64d44f78-3c0b-400d-afbe-b33e6be9c023

More Info

expand_more

Published Date

26-02-2020

Language

English

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Faculty

Electrical Engineering, Mathematics and Computer Science

Abstract

In this thesis we explore the acceleration of sorting algorithms on FPGAs using high bandwidth memory (HBM). The target application is an FPGA as an accelerator in an OpenCAPI enabled system, that enables the accelerator to access main memory of the host at a bandwidth of 25 GB/s for either read or write. We explore under what read and write access patterns the HBM bandwidth of 460.8 GB/s can be met and identify specific circumstances under which this bandwidth can be achieved. The sorting algorithm is implemented in hardware as two steps: partitioning and sorting. We design two partitioning architectures and one sorting architecture. The sorting architecture sorts buckets generated in the partitioning step and is based on merge sort. It uses HBM and wide merge trees to reduce the number of passes through a memory. The architectures themselves are to be instantiated multiple times on the FPGA to achieve a higher sorting throughput. Simulating each architecture at 225 MHz, they are all designed to output up to 3.6 GB/s of 8+8 byte key-value pairs under ideal conditions. We measure the first and second partitioning architectures and identify a bottleneck in HBM for the former, resulting in only 0.44 GB/s with a (uniformly) random input, due to a strided access pattern. With a sorted input, the throughput is 2.18 GB/s. The second partitioning architecture is not affected by this and achieves a throughput of approximately 1.7 GB/s for both types of input. The sorter performs best, sorting buckets at approximately 2.7 GB/s. Synthesis results show that the target FPGA has enough resources for tens of partitioners and sorters, allowing to create a sorting hardware that saturates system bandwidth.

Files

BFeenstra_thesis.pdf

(.pdf | 1.78 Mb)