Implementation and Evaluation of Packed-SIMD Instructions for a RISC-V Processor

More Info
expand_more

Abstract

With the increase in the amount of data being gathered, the need for data processing is also rising. Furthermore, in addition to the proprietary ISAs that have been prevalent, the free and open RISC-V ISA has seen major interest. The modularity of the RISC-V ISA allows it to be extended with many instruction set extensions. One such extension that aids in the processing of large amounts of data is the P-extension, which introduces packed-SIMD instructions. In this thesis, the RISC-V based open-source CVA6 processor is extended to support the SIMD instructions defined by the P-extension. In order to do so, the 332 instructions of the P-extension are divided into subsets based on the type of instructions used by applications that make use of SIMD instructions and the hardware needed to implement those instructions. Due to time constraints 268, or 80.7%, of the total 332 instructions were implemented. However, this includes all the instructions that could be utilized by the used benchmarks. Therefore, the benchmark results show the full performance achievable by the P-extension. These 268 instructions make up the basic, MAC 8-bit, MAC 16-bit, and MAC 32-bit subsets. The ALU has been modified to operate in a SIMD manner on 8 8-bit, 4 16-bit, 2 32-bit, and 64-bit elements. Moreover, it has been extended to support new operations like data movement or reorganization instructions. Like the ALU, the multiplier has also been converted into a SIMD multiplier using a SIMD Baugh-Wooley scheme. Furthermore, the multiplier has been extended to also function as a MAC unit. The impact of these newly added SIMD instructions is tested in an ideal scenario of matrix multiplication as well as in a real-world machine learning application. In matrix multiplication, a speedup of up to 8.8x and 7.2x is seen for 8-bit and Q7 elements and 4.9x and 3.8x for 16-bit and Q15 elements when only using instructions from the basic subset. When also using the MAC instructions, the speedup increases to up to 12.3x and 12.6x for 8-bit and Q7 elements and 6.7x and 6.5x for 16-bit and Q15 elements. The real-world benchmark consists of an image recognition convolutional neural network based on the CIFAR-10 data set. In this benchmark, a speedup of 2.1x and 3.3x is obtained for respectively the basic subset and with MAC instructions. However, the additional hardware comes at a cost, specifically an increase of 5.0% LUT and 0.05% flip-flop usage for the basic subset or 7.2% increased LUT and 0.56% increased flip-flop usage with the basic and MAC subsets. While the additional hardware can have an impact on the maximum achievable clock frequency, the critical path remains in the FPU. The maximum achievable clock frequency is therefore not impacted and reaches 70 MHz on a Xilinx Kintex-7 FPGA.