C. Galuzzi | TU Delft Repository

A Real-Time Reconfigurable Multichip Architecture for Large-Scale Biophysically Accurate Neuron Simulation

Journal article (2018) - Amir Zjajjo, Jaco Hofmann, Gerrit Jan Christiaanse, Martijn van Eijk, Georgios Smaragdos, Christos Strydis, Carlo Galuzzi, Rene van Leuken, Alexander de Graaf

Simulation of brain neurons in real-time using biophysically meaningful models is a prerequisite for comprehensive understanding of how neurons process information and communicate with each other, in effect efficiently complementing in-vivo experiments. State-of-the-art neuron simulators are, however, capable of simulating at most few tens/hundreds of biophysically accurate neurons in real-time due to the exponential growth in the interneuron communication costs with the number of simulated neurons. In this paper, we propose a real-time, reconfigurable, multichip system architecture based on localized communication, which effectively reduces the communication cost to a linear growth. All parts of the system are generated automatically, based on the neuron connectivity scheme. Experimental results indicate that the proposed system architecture allows the capacity of over 3000 to 19 200 (depending on the connectivity scheme) biophysically accurate neurons over multiple chips. ...

A real-time hybrid neuron network for highly parallel cognitive systems

Conference paper (2016) - G.J. Christiaanse, A. Zjajo, C. Galuzzi, R. van Leuken

For comprehensive understanding of how neurons communicate with each other, new tools need to be developed that can accurately mimic the behaviour of such neurons and neuron networks under `real-time' constraints. In this paper, we propose an easily customisable, highly pipelined, neuron network design, which executes optimally scheduled floating-point operations for maximal amount of biophysically plausible neurons per FPGA family type. To reduce the required amount of resources without adverse effect on the calculation latency, a single exponent instance is used for multiple neuron calculation operations. Experimental results indicate that the proposed network design allows the simulation of up to 1188 neurons on Virtex7 (XC7VX550T) device in brain real-time yielding a speed-up of x12.4 compared to the state-of-the art. ...

Multi-Chip Dataflow Architecture for Massive Scale Biophysically Accurate Neuron Simulation

Conference paper (2016) - Jaco Hofman, Amir Zjajo, Carlo Galuzzi, Rene van Leuken

State-of-the-art neuron simulators are capable of simulating at most few tens/hundreds of neurons in real-time due to the exponential growth in the communication costs with the number of simulated neurons. In this paper, we present a novel, reconfigurable, multi-chip system architecture based on localized communication, which effectively reduces the communication cost to a linear growth. The system is very flexible and it allows to tune, at run-time, various parameters, e.g. the intracellular concentration of chemical compounds, the interconnection scheme between the neurons. Experimental results indicate that the proposed system architecture allows the simulation of up to few thousands biophysically accurate neurons over multiple chips. ...

Determining Performance Boundaries on High-Level System Specifications

Conference paper (2016) - Wouter van Teijlingen, Rene van Leuken, Carlo Galuzzi, Bart Kienhuis

We can significantly reduce the time required to realize designs if it is possible to find limits to the performance of an embedded system, solely based on high-level system specifications. For that purpose, we present in this paper the cprof profiler, which determines the number of clock cycles needed to execute a C-program in hardware. The cprof tool is based on the Clang compiler front-end to parse C-programs and to produce instrumented source code for the profiling. Using cprof, we determine a lower and upper bound limit for all 29 cases of the PolyBench/C benchmark suite. The lower and upper bound are determined using the absolute performance estimations assuming all statement are mapped onto the same processing resource and unbounded performance estimations assuming unlimited resources. We also compared the clock cycles found by cprof with RTL implementations for all 29 Polybench/C cases and found that cprof determines with 1.2% accuracy the correct number of clock cycles. It does this in a fraction of the time compared to the time needed to do a full RTL simulation. ...

Low Power Programmable Gain Analog to Digital Converter for Integrated Neural Implant Front End

Conference paper (2015) - A Zjajo, C Galuzzi, TGRM van Leuken

Integrated neural implants interface with the brain using biocompatible electrodes to provide high yield cell recordings, large channel counts and access to spike data and/or field potentials with high signal-to-noise ratio. By increasing the number of recording electrodes, spatially broad analysis can be performed that can provide insights on how and why neuronal ensembles synchronize their activity. However, the maximum number of channels is constrained by noise, area, bandwidth, power, thermal dissipation and the scalability and expandability of the recording system. In this chapter, we characterize the noise fluctuations on a circuit-architecture level for efficient hardware implementation of programmable gain analog to digital converter for neural signal-processing. This approach provides key insight required to address signal-to-noise ratio, response time, and linearity of the physical electronic interface. The proposed methodology is evaluated on a prototype converter designed in standard single poly, six metal 90-nm CMOS process. ...

Towards domain-specific Instruction-Set Generation

Conference paper (2014) - Adithya Pulli, Carlo Galuzzi, Georgi Gaydadjiev

Over the past years, a considerable amount of effort has been devoted to the definition and implementation of techniques for the optimization and acceleration of applications on various (reconfigurable) computing platforms. Among these techniques, the extension of a given instruction-set architecture with custom instructions has become a common approach. Custom instructions effectively reduce the dynamic instruction count, which, in turn, leads to an increase in performance. Traditionally, existing techniques address Instruction-Set Extension (ISE) on a per-application basis. Anyhow, when many applications have to be considered at the same time, ISE on a per-application basis is, clearly, less effective, as the custom instructions have, often, limited re-utilization across applications. To overcome this problem, we propose a new framework for the automatic generation of domain-specific ISEs. Experimental results show that, the proposed framework, evaluated on a number of applications from various domains, can effectively generate domain-specific instructions with high utilization factor across the targeted applications. At the same time, the generated instructions dramatically reduce the instruction count, 50% on average and upto 95% in special cases. This, in turn, can lead to a considerable improvement in performance. ...

The Q² profiling framework: driving application mapping for heterogeneous reconfigurable platforms

Conference paper (2012) - SA Ostadzadeh, RJ Meeuws, I Ashraf, C Galuzzi, KLM Bertels

QUAD - a memory access pattern analyser

Conference paper (2010) - SA Ostadzadeh, RJ Meeuws, C Galuzzi, K Bertels

A linear complexity algorithm for the generation of multiple input single output instructions of variable size

Conference paper (2007) - C Galuzzi, K Bertels, S Vassiliadis

High-bandwidth address generation unit

Conference paper (2007) - H Calderon, C Galuzzi, GN Gaydadjiev, S Vassiliadis

A linear complexity algorithm for the automatic generation of convex multiple input multiple output instructions

Conference paper (2007) - C Galuzzi, K Bertels, S Vassiliadis