Optimizing Memory Mapping for Dataflow Inference Accelerators

Efficient Memory Utilization on FPGAs

More Info
expand_more

Abstract

Convolutional Neural Network (CNN) inference has gained a significant amount of traction for performing tasks like speech recognition and image classification. To improve the accuracy with which these tasks can be performed, CNNs are typically designed to be deep, encompassing a large number of neural network layers. As a result, the computational intensity and storage requirements increase dramatically, necessitating hardware acceleration to reduce the execution latency. Field-Programmable Gate Arrays (FPGAs) in particular are well-suited for hardware acceleration of CNN inference, since the underlying hardware can be tailored to rapidly and efficiently perform the required operations. To this end, Xilinx introduced the FINN (Fast, Scalable Quantized Neural Network Inference on FPGAs) framework to leverage FPGAs for Neural Network (NN) inference. The FINN end-to-end deep learning framework converts high-level descriptions of CNN models into fast and scalable FPGA inference accelerator designs that are based on a custom dataflow architecture. In this dataflow architecture the input data are streamed in a feed forward fashion through a pipeline of per-layer dedicated compute units, that each have on-chip access to the associated NN parameters. In order to keep the compute units occupied, specific throughput requirements have to be satisfied by the memory subsystem. These throughput requirements directly dictate the shapes of the on-chip buffers that contain the NN parameter values. Especially for accelerators that exploit a high degree of parallelism, these memory shapes map poorly to the available on-chip memory resources of FPGA devices. As a result, these resources are typically underutilized, which leads an On-Chip Memory (OCM) deficiency, and limits the amount of parallelism that can be exploited. In this thesis, a methodology is proposed that improves the mapping efficiency of NN parameter buffers to the embedded Block RAM (BRAM) resources on FPGAs, without negatively impacting the accelerator throughput. To accomplish this, an architecture is proposed where the memory subsystem and compute units are decoupled, and operate as a producer-consumer system. Within this architecture, the memory subsystem functions at a higher clock frequency relative to the compute units, which enables the memory subsystem to match the consumption rate of the compute units when multiple NN parameter buffers are clustered within the same BRAM instance. Furthermore, a genetic algorithm is used to find optimal group arrangements for these clusters such that the mapping efficiency is improved, and the throughput requirements are still met. The proposed methodology has been applied to a number of CNN accelerators, and demonstrates BRAM reductions of up to 30%. The observed BRAM reductions enable existing FINN accelerator designs to be ported to smaller FPGA devices while maintaining the computational throughput.