Integration of a convolutional neural network for speech-to-text recognition in an FPGA compiler flow

More Info
expand_more

Abstract

Deep Neural Network (DNNs) have increased significantly in size over the past decade. Partly due to this, the accuracy of DNNs in image classification and speech recognition tasks has increased as well. This enables a great potential for such models to be applied in real-world applications. However, due to their size, the compute and power requirements are often too large to deploy these models on edge devices. This prohibits applying such models within a rich field of application demanding high-throughput and real-time execution.

Deploying quantized DNNs on Field Programmable Gate Arrays (FPGAs) overcomes this problem. FPGAs are well known for their low-latency, high-throughput, and low-energy capabilities. However, creating hand-tuned FPGA designs requires expert-level knowledge of the underlying hardware domain. Especially for mathematicians or software engineers that develop new quantized DNNs, but also for experienced hardware designers that want to implement a large DNN on an FPGA, the implementation burden is often too large to reap any practical benefits from accelerating the application on an FPGA.

The open-source FINN compiler, introduced by Xilinx Research Lab, provides an excellent bridge between the software and hardware domain by allowing quantized DNN inference FPGA accelerators to be generated from a high-level description of the quantized DNN in the widely adopted open-source ONNX format. Due to lower-level implementation details being abstracted away, the question is how this affects the performance of the generated accelerator.

This work examines whether FPGA implementations of CNN-based models for speech-to-text inference can be generated automatically by means of FINN. For this purpose, a sub state-of-the-art CNN for speech-to-text recognition, named QuartzNet, is targeted for FPGA acceleration.
To achieve this, extensions to the FINN compiler are proposed to enable generating 1D CNN inference accelerators for FPGAs. Furthermore, a proof-of-concept FPGA accelerator of a quantized QuartzNet model is implemented by means of FINN. Compared to a high-end CPU device, the proposed FPGA accelerator achieves 7.7x higher throughput and 8.2x lower latency for a speech recognition inference task. Compared to a high-end GPU device, the proposed FPGA accelerator improves the energy efficiency by 6.8% at the expense of lower throughput and higher latency.

By generating an FPGA accelerator for a quantized version of the QuartzNet model, this work bridges the software and hardware domain by showcasing how a trained CNN in the software domain can be transformed to create a high-throughput, low-latency, and energy-efficient FPGA accelerator with a fraction of the design effort required compared to constructing a handwritten RTL implementation.