A Hardware/Software Co-designed Partitioning Algorithm of Sparse Matrix Vector Multiplication into Multiple Independent Streams for Parallel Processing

More Info
expand_more

Abstract

The trend of computing faster and more efficiently has been a driver for the computing industry since its beginning. However, it is increasingly difficult to continue this trend because current CMOS technology cannot be down-scaled anymore due to physical restrictions. Consequently, to obtain the next major performance improvement, the focus is shifting from a technology-only optimization effort towards a system-level hardware-software co-design optimization strategy. In recent years, the move to heterogeneous computing has gained enormous traction with all the big names such as Intel, IBM, and NVIDIA investing heavily in this approach. This paradigm shift is characterized by traditional general-purpose processors offloading data to hardware accelerators, which are capable of exploiting parallelism to a significantly higher degree. An accelerator which has existed for decades but has recently risen to greater prominence is the field-programmable gate array (FPGA). The scientific computing community is also experiencing the need for higher computational power as their problem sizes increase. FPGAs make a promising candidate for their ability to tailor complex algorithms to specialized hardware circuits. A key algorithm to accelerate in this domain is the Sparse Matrix Vector Multiplication (SpMV). There do not exist many HLS (High-Level Synthesis) designs for this kernel, and the one designed using Vivado HLS exhibits significantly lower performance than the state-of-the-art. We argue that the most effective way to achieve speedup is by implementing multiple parallel pipelines so that multiple result values are produced in each cycle. Consequently, we develop an implementation agnostic partitioning algorithm for SpMV that splits the problem into independent streams. The HLS kernel performs well as a standalone unit, offering a speedup of up to 150x compared to the ARM coprocessor on the ZYNQ system and up to 4.6x to state-of-the-art Vivado HLS-based solutions. Our estimations show that the solution scales with an increasing number of resources.

Files

Bjorn_thesis.pdf
(.pdf | 2.94 Mb)
- Embargo expired in 07-11-2019