Exploring Convolutional Neural Networks on the ρ-VEX architecture

More Info
expand_more

Abstract

As machine learning algorithms play an ever increasing role in today's technology, more demands are placed on computational hardware to run these algorithms efficiently. In recent years, Convolutional Neural Networks (CNNs) have become an important part of machine learning applications in areas such as object recognition and detection. In this thesis we will explore how we can implement CNNs on the ρ-VEX processor and what can be done to optimize the performance.

The ρ-VEX processor is a VLIW processor that was developed at the Delft University of Technology and that can be reconfigured during runtime to take advantage of Instruction Level Parallelism (ILP) and Thread Level Parallelism (TLP) in an application. In this work we have developed a streaming pipeline in a simulator consisting of multiple 8-issue ρ-VEX cores connected with memory buffers. This pipeline was designed to execute CNN inference and take advantage of the overlapped execution to increase throughput. Furthermore, as each ρ-VEX core can be configured to operate in a one core, two core or four core mode based on available ILP and TLP, we can adapt the processor based on the current operation being executed and the amount of parallelism that is available.

By generating the required code from a high-level description of the CNN, it becomes straightforward to test multiple configurations of the pipeline and determine which creates the best performance. The implementation was subsequently tested using a simple network trained on the MNIST dataset.

By dividing the workload of the convolutional layers over multiple contexts to take advantage of data-level parallelism, we improved the latency by 3.03x and the throughput by 3.14x in simulation. By creating a pipeline of six cores in a single context configuration in the simulator, we achieved a throughput increase of 1.77x. A hardware implementation of the pipeline was also synthesized for a Virtex-6 FPGA, consisting of four 2-issue ρ-VEX cores at a clock speed of 200 MHz. We subsequently propose several optimizations to increase performance of CNN inference on the ρ-VEX architecture.