Complement-based Stochastic Computing Multiplier Design for Convolutional Neural Network Acceleration
J.J. Hejderup (TU Delft - Electrical Engineering, Mathematics and Computer Science)
Sorin D. Cotofana – Mentor (TU Delft - Computer Engineering)
S Wong – Graduation committee member (TU Delft - Computer Engineering)
René Leuken – Graduation committee member (TU Delft - Signal Processing Systems)
More Info
expand_more
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.
Abstract
Recently, it has become popular to use Convolutional Neural Networks (CNNs) in embedded and portable devices. The popularity is based on their high accuracy rate in the field of Computer Vision (CV). However, CNNs are computationally intensive due to the convolutional layer, which accounts for over 90% of the operations. To overcome this problem, many researchers have exerted efforts to develop parallel and customised accelerators. Methods utilised in the accelerators range from bit optimisation to using fixed-point arithmetic, and to reducing the size of the network. Some researchers have also explored alternative computing paradigms such as Stochastic Computing (SC). The great advantage of SC is its ability to perform complex arithmetic with simple hardware. However, a major problem of SC is the trade-off between latency and accuracy. Thus, there have been several attempts to mitigate this factor, ranging from improving the generation of stochastic numbers to parallel bitstreams, to early terminations. This thesis proposes StoHej, a new SC multiplier design that combines stochastic bitstreams and complementary events. The multiplier has two input types, the first is the neural network feature value and the second is the weight value. The weight value determines how many iterations the computation requires. A complement event is utilised if the weight value is greater or equal to $0.5$ since the complement of the event yields a smaller number. Thus, the worst-case latency has been reduced from O(N) to O(N/2). The proposed multiplier was compared with a Conventional Stochastic Computer (CSC) multiplier and the BISC-MVM multiplier, which is the state-of-the-art for SC multipliers that uses an early termination mechanism. All multipliers were first tested in a software simulation in a general context. Accuracy and latency were measured in a software simulation. The results from these simulations showed a 3.2x speedup for the proposed design compared to BISC-MVM, with no increase in computational errors. Then, StoHej and BISC-MVM were tested in a CNN inference application with the MNIST dataset. The multipliers were used in a Multiply-Accumulate (MAC) array that was implemented on an FPGA. The results from the experiment show that StoHej had a 1.7x speedup and no loss in accuracy compared to BISC-MVM. StoHej's energy consumption was reduced by 40% when compared to BISC-MVM. The Area-Delay Product (ADP) of StoHej was 30% smaller than BISC-MVM. StoHej's Area-Delay-Energy Product is 2.3x smaller than the BISC-MVM multiplier.