Customizable Register Files for Multidimensional SIMD Architectures

More Info
expand_more

Abstract

Processor clock frequencies and the related performance improvements recently stagnated due to severe power and thermal dissipation barriers. As a result, the additional transistors provided by new technology generations are turned into more processing elements on a chip and used for their specialization towards power efficiency. For data parallel workloads the Single Instruction Multiple Data (SIMD) accelerators form a good example. SIMD processors, however, are notorious for turning performance programmers into low-level hardware experts. Moreover, legacy programs often require rework to follow (micro)architectural evolutions. This dissertation addresses the problems of SIMD accelerators programmability, code portability and performance efficient data management. The proposed Polymorphic Register File (PRF) provides a simple programming interface, allowing programmers to focus on algorithm optimizations rather than complex data transformations or low-level details. The overall PRF size is fixed, while the actual number, dimensions and sizes of its individual registers can be readjusted at runtime. Once the registers are defined, the microarchitecture takes care of the data management. We base our proposal on a 2D addressable multi-banked parallel storage, simultaneously delivering multiple data elements for a set of predetermined access patterns. For each pattern, we declare a Module Assignment Function (MAF) and a customized addressing function. We propose four MAF sets fully covering practical access patterns and evaluate them in a technology independent way. Next, we study a multi-lane, multi-port design and its HDL implementation. Clock frequencies of 100 to 300 MHz for FPGA and 500 to 900+ MHz for ASIC synthesis strongly indicate our PRF practical usability. For representative matrix computation workloads, single-core experiments suggest that our approach outperforms the Cell SIMD engine by up to three times. Furthermore, the number of executed instructions is reduced by up to three orders of magnitude compared to the Cell scalar core, depending on the vector registers size. Finally, we vectorize a separable 2D convolution algorithm for our PRF to fully avoid strided memory accesses, outperforming a state of the art NVIDIA GPU in throughput for mask sizes of 9 x 9 elements and bigger.