Nils Voss
Please Note
8 records found
1
We propose a design methodology to facilitate rigorous development of complex applications targeting reconfigurable hardware. Our methodology relies on analytical estimation of system performance and area utilisation for a given specific application and a particular system instance consisting of a controlflow machine working in conjunction with one or more reconfigurable dataflow accelerators. The targeted application is carefully analyzed, and the parts identified for hardware acceleration are reimplemented as a set of representative software models. Next, with the results of the application analysis, a suitable system architecture is devised and its performance is evaluated to determine bottlenecks, allowing predictable design. The architecture is iteratively refined, until the final version satisfying the specification requirements in terms of performance and required hardware area is obtained. We validate the presented methodology using a widely accepted convolutional neural network (VGG-16) and an important HPC application (BQCD). In both cases, our methodology relieved and alleviated all system bottlenecks before the hardware implementation was started. As a result the architectures were implemented first time right, achieving state-of-the-art performance within 15% of our modelling estimations.
We propose a novel reconfigurable hardware architecture to implement Monte Carlo based simulation of physical dose accumulation for intensity-modulated adaptive radiotherapy. The long term goal of our effort is to provide accurate dose calculation in real-time during patient treatment. This will allow wider adoption of personalised patient therapies which has the potential to significantly reduce dose exposure to the patient as well as shorten treatment and greatly reduce costs. The proposed architecture exploits the inherent parallelism of Monte Carlo simulations to perform domain decomposition and provide high resolution simulation without being limited by on-chip memory capacity. We present our architecture in detail and provide a performance model to estimate execution time, hardware area and bandwidth utilisation. Finally, we evaluate our architecture on a Xilinx VU9P platform as well as the Xilinx Alveo U250 and show that three VU9P based cards or two Alevo U250s are sufficient to meet our real time target of 100 million randomly generated particle histories per second.
We propose a novel reconfigurable hardware architecture to implement Monte Carlo based simulation of physical dose accumulation for intensity-modulated adaptive radiotherapy. The long term goal of our effort is to provide accurate online dose calculation in real-time during patient treatment. This will allow wider adoption of personalised patient therapies which has the potential to significantly reduce dose exposure to the patient as well as shorten treatment and greatly reduce costs. The proposed architecture exploits the inherent parallelism of Monte Carlo simulations to perform domain decomposition and provide high resolution simulation without being limited by on-chip memory capacity. We present our architecture in detail and provide a performance model to estimate execution time, hardware area and bandwidth utilisation. Finally, we evaluate our architecture on a Xilinx VU9P platform and show that three cards are sufficient to meet our real time target of 100 million randomly generated particle histories per second.
This paper proposes an algorithm for mapping logical to physical memory resources on FPGAs. Our greedy strategy based algorithm is specifically designed to facilitate timing closure on modern multi-die FPGAs for static-dataflow accelerators utilising most of the on-chip resources. The main objective of the proposed algorithm is to ensure that specific sub-parts of the design under consideration can fully reside within a single die to limit inter-die communication. The above is achieved by performing the memory mapping for each sub-part of the design separately while keeping allocation of the available physical resources balanced. As a result the number of inter-die connections is reduced on average by 50% compared to an algorithm targeting minimal area usage for real, complex applications using most of the on-chip's resources. Additionally, our algorithm is the only one out of the four evaluated approaches which successfully produces place and route results for all 33 applications and benchmarks.
In this paper we propose a technique to minimise the area overhead of a double buffered implementation of Radix-4 Fast Fourier Transformation (FFT). Our proposal circumvents the need for double buffering by exploiting opportunities in the specific data reordering of the buffers that are needed when implementing a fully pipelined FFT. By using the same read and write pattern, a single buffer is sufficient to perform data reordering while maintaining data integrity without degrading performance. We demonstrate this approach in an FPGA implementation. As a result of our optimisation, the memory depth can be reduced by a factor of two with very small overhead in control logic complexity.
In this paper we discuss a high performance implementation for Convolutional Neural Networks (CNNs) inference on the latest generation of Dataflow Engines (DFEs). We discuss the architectural choices made during the design phase taking into account the DFE chip properties. We then perform design space exploration, considering the memory bandwidth and resources utilisation constraints derived from the used DFE and the chosen architecture. Finally, we discuss the high performance implementation and compare the obtained performance against other implementations, showing that our proposed design reaches 2,450 GOPS when running VGG16 as a test case.
In this paper we present several algorithms used to construct a tool that automatically optimizes static dataflow graphs for the purpose of high level hardware synthesis. Our target is to automatically merge multiple dataflow graphs in order to create a single structure implementing all distinct operations with minimal area overhead by time-slicing hardware resources. We show that a combination of dedicated optimizations and a simple greedy approach for graph merging reduces the overall area by up to 4x compared to a naive hardware implementation.
Design productivity is essential for high–performance application development involving accelerators. Low level hardware description languages such as Verilog and VHDL are widely used to design FPGA accelerators, however, they require significant expertise and considerable design efforts. Recent advances in high–level synthesis have brought forward tools that relieve the burden of FPGA application development but the achieved performance results can not approximate designs made using low–level languages. In this paper we compare different FPGA implementations of gzip. All of them implement the same system architecture using different languages. This allows us to compare Verilog, OpenCL and MaxJ design productivity. First, we illustrate several conceptional advantages of the MaxJ language and its platform over OpenCL. Next we show on the example of our gzip implementation how an engineer without previous MaxJ experience can quickly develop and optimize a real, complex application. The gzip design in MaxJ presented here took only one man–month to develop and achieved better performance than the related work created in Verilog and OpenCL.