T.G.R.M. van Leuken | TU Delft Repository

Jumping Shift

A Logarithmic Quantization Method for Low-Power CNN Acceleration

Conference paper (2023) - Longxing Jiang, David Aledo , Rene van Leuken

Logarithmic quantization for Convolutional Neural Networks (CNN): a) fits well typical weights and activation distributions, and b) allows the replacement of the multiplication operation by a shift operation that can be implemented with fewer hardware resources. We propose a new quantization method named Jumping Log Quantization (JLQ). The key idea of JLQ is to extend the quantization range, by adding a coefficient parameter “s” in the power of two exponents $(2^{sx+i})$. This quantization strategy skips some values from the standard logarithmic quantization. In addition, we also develop a small hardware-friendly optimization called weight de-zero. Zero-valued weights that cannot be performed by a single shift operation are all replaced with logarithmic weights to reduce hardware resources with almost no accuracy loss. To implement the Multiply-And-Accumulate (MAC) operation (needed to compute convolutions) when the weights are JLQ-ed and de-zeroed, a new Processing Element (PE) have been developed. This new PE uses a modified barrel shifter that can efficiently avoid the skipped values. Resource utilization, area, and power consumption of the new PE standing alone are reported. We have found that JLQ performs better than other state-of-the-art logarithmic quantization methods when the bit width of the operands becomes very small. ...

Near-Precise Parameter Approximation for Multiple Multiplications on A Single DSP Block

Journal article (2022) - Ercan Kalali, Rene Van Leuken

DSP blocks are one of the efficient solutions to implement multiply-accumulate (MAC) operations on FPGAs. However, since the DSP blocks have wide multiplier and adder blocks, MAC operations using low bit-length parameters lead to an underutilization. Hence, an efficient approximation technique is introduced. The technique includes manipulation and approximation of the low bit-length parameters based upon a Single DSP - Multiple Multiplication (SDMM) execution. The accuracy of the developed optimization technique was evaluated for different CNN weight bit precisions using the Alexnet and VGG-16 networks and the ImageNet ILSVRC-2012 dataset. The optimization can be implemented without loss of accuracy in almost all cases, while it causes slight accuracy losses in a few cases. Through these optimizations, multiple parameter multiplications are performed in a single DSP block at the cost of a small hardware overhead. As a result of our optimizations, the parameters are represented in a different format on off-chip memory, providing up to 33% compression without any hardware cost. A prototype systolic array architecture was implemented employing our optimizations on a Xilinx Zynq FPGA. It reduced the number of DSP blocks by 66.6%, 75%, and 83.3% for 8, 6, and 4-bit input variables, respectively. ...

Temporal synchronization of radar and lidar streams

Conference paper (2022) - D. Aledo Ortega, T. Manjunath, R.T. Rajan, Darek Maksimiuk, T.G.R.M. van Leuken

In multi-sensor systems, several sensors produce data streams, commonly, at different frequencies. If they are let running wild without synchronization, after a period of time, they are likely to be disordered, presenting as simultaneous measures that have been recorded at different times. That can be disastrous in many data fusion applications. This paper is about their temporal synchronization and ordering, so they can be coherently fused. Some sensors do not have timestamps from which order the streams, and even if they have, they may be not trustable for different reasons. First, we define mathematically the problem of multi-sensor data stream synchronization. Then, we handle the problem of estimating the actual time of sensor measurement using mean or median filters. Next, we address the issue of reconstructing incoming sensor data streams according to the estimated sensor measurement times while maintaining minimal latency and synchronization error by employing an adaptive stream buffering technique utilized in distributed multimedia systems. In order to test our methods, we have recorded an easy-to-use dataset with a radar and a lidar sensors without timestamps. We define a synchronization event that is easily identifiable by a human annotator in both sensor streams. From this dataset, a suitable filter for timestamp estimation is selected, and an analysis of the effects of the stream synchronization algorithm’s parameters on buffering latency and synchronization error is presented. Finally, the solution is efficiently implemented on a FPGA ...

A Power-Efficient Parameter Quantization Technique for CNN Accelerators

Conference paper (2021) - Ercan Kalali, Rene van Leuken

Quantization techniques are widely used in CNN inference to reduce the cost of hardware at the expense of small accuracy losses. However, after the quantization, there is still a multiplication cost for the fixed-point quantized CNN weights. Therefore, a novel CNN quantization technique is introduced, which can be implemented without using any multiplier. We evaluated our quantization technique using VGG-16 and Alexnet networks, and the Tiny ImageNet dataset. The quantization technique causes 0.39% and 0.98% accuracy losses for the 8-bit CNN weights compared to floating-point implementations of VGG-16 and Alexnet, respectively. After, a fine-tuning method for our quantization is introduced, which further reduces the accuracy loss. The fine-tuning reduced the accuracy losses on 8-bit quantized VGG-16 and Alexnet to 0.24% and 0.39%, respectively. Two different processing element architectures, which do not include any multiplier hardware, are designed to perform multiply-accumulate (MAC) operations of CNN models quantized by our technique. Two different systolic array prototypes are designed employing the two PE architectures to compare with the traditional fixed-point MAC implementation. The systolic array architectures containing our processing element designs reduced the power consumption of the systolic array up to 14.2% and 21.6%. ...

Towards robust inference of biomedical signals in energy-efficient neuromorphic networks

Conference paper (2019) - Amir Zjajo, Sumeet Kumar, Rene Van Leuken

Computation capability characteristics of neuromorphic analog/mixed-signal spiking neural networks offer capable platform for implementation of cognitive tasks on resource-limited embedded platforms. In this paper, we derive stochastic model of spiking neural processing systems for energy-efficient recognition and inference of biomedical systems. We examine imperfections in the network dynamics and noise-induced information processing, influence of the uncertainty on the behavior of the emulated networks, and impact on the clustering accuracy of cardiac arrhythmia. Experimental results indicate that stochasticity at networks connections is a adequate resource for deep learning machines. ...

Multi-Layer Neuromorphic Synapse for Reconfigurable Networks

Conference paper (2019) - Amir Zjajo, Sumeet Kumar, Rene Van Leuken

In pulse-based neural networks, synaptic dynamics can have direct influence on learning of neural codes, and encoding of spatiotemporal spike patterns. In this paper, we propose an adaptive synapse circuit for increased flexibility and efficacy of signal processing units in neuromorphic structures. The synapse acts as a multi-layer computational network, and includes multi-compartment dendrites and different types of post-synaptic back propagating signals. With built-in temporal control mechanisms, the resulting reconfigurable network allows the implementation of synaptic homeostatics. ...

Towards Computationally-Efficient Cognitive Sensor Systems for Autonomous Vehicles

Conference paper (2019) - Shashanka Marigi Rajanarayana, Sumeet Kumar, Amir Zjajo, Rene Van Leuken

Advanced driving assistance systems (ADAS) prepave regulators, consumers and corporations for the medium-term reality of autonomous driving with adaptive cruise control, collision avoidance and lane departure warning system. Various sensors like camera, RADAR and LIDAR, integrated into the vehicle assist driving. In addition, deep learning approaches are utilized in a wide range of applications ranging from object detection and scene segmentation to engine fault diagnosis and emission management to detect vehicle network intrusion. In this paper, we scope out the state of the art sensors subsystems in terms of its functionality, characteristics, specifications and communication protocol, and we describe cognitive deep learning based algorithms required for environment perception through these sensors. Subsequently, we analyze the cognitive algorithm by profiling the standard deep learning models, explore different compute platforms and possible algorithm and hardware optimization scenarios. ...

Heterogeneous Activation Function Extraction for Training and Optimization of SNN Systems

Conference paper (2019) - Amir Zjajo, Sumeet Kumar, Rene Van Leuken

Energy-efficiency and computation capability characteristics of analog/mixed-signal spiking neural networks offer capable platform for implementation of cognitive tasks on resource-limited embedded platforms. However, inherent mismatch in analog devices severely influence accuracy and reliability of the computing system. In this paper, we devise efficient algorithm for extracting of heterogeneous activation functions of analog hardware neurons as a set of constraints in an off-line training and optimization process, and examine how compensation of the mismatch effects influence synchronicity and information processing capabilities of the system. ...

A sub-nW neuromorphic receptors for wide-range temporal patterns of post-synaptic responses in 65 nm CMOS

Journal article (2019) - Xuefei You, Amir Zjajo, Sumeet Kumar, Rene van Leuken

Synaptic dynamics is of great importance in realizing biophysically accurate neural behaviors and efficient synaptic learning in neuromorphic integrated circuits. In this paper, we propose a current-based synapse structure with multi-compartment receptors AMPA, NMDA and GABAa and a weight-dependent learning algorithm. The designed circuit offers distinctive dynamic features of receptors as well as a joint synaptic function. A cross-correlation methodology is applied to a two-layer RNN built by multi-compartment receptors to demonstrate the proposed synapse structure. An increased computation efficiency is verified through temporal synchrony detection among the neural layers in a noisy environment. The design implemented in TSMC 65 nm CMOS technology consumes 1.92, 3.36, 1.11 and 35.22 pJ per spike event of energy for AMPA, NMDA, GABAa and the advanced learning circuit, respectively. ...

Energy-Efficient Multipath Ring Network for Heterogeneous Clustered Neuronal Arrays

Conference paper (2018) - Andrei Ardelean, Amir Zjajo, Sumeet Kumar, Rene van Leuken

Simulating large spiking neural networks with a high level of realism in a FPGA requires efficient network architectures that satisfy both the resource and interconnect constraints, as well as the changes in traffic patterns due to learning processes. In this paper, we propose a dataflow architecture based on a multipath ring topology that offers traffic shaping capabilities, and high energy-efficiency for the neuron-to-neuron communications. ...

Uncertainty in Noise-Driven Steady-State Neuromorphic Network for ECG Data Classification

Conference paper (2018) - Amir Zjajo, Johan Mes, Sumeet Kumar, Eralp Kolagasioglu, Rene van Leuken

The pathophysiological processes underlying the ECG tracing demonstrate significant heart rate and the morphological pattern variations, for different or in the same patient at diverse physical/temporal conditions. Within this framework, spiking neural networks (SNN) may be a compelling approach to ECG pattern classification based on the individual characteristics of each patient. In this paper, we study electrophysiological dynamics in the self-organizing map SNN when the coefficients of the neuronal connectivity matrix are random variables. We examine synchronicity and noise-induced information processing, influence of the uncertainty on the system signal-to-noise ratio, and impact on the clustering accuracy of cardiac arrhythmia. ...

A Real-Time Reconfigurable Multichip Architecture for Large-Scale Biophysically Accurate Neuron Simulation

Journal article (2018) - Amir Zjajjo, Jaco Hofmann, Gerrit Jan Christiaanse, Martijn van Eijk, Georgios Smaragdos, Christos Strydis, Carlo Galuzzi, Rene van Leuken, Alexander de Graaf

Simulation of brain neurons in real-time using biophysically meaningful models is a prerequisite for comprehensive understanding of how neurons process information and communicate with each other, in effect efficiently complementing in-vivo experiments. State-of-the-art neuron simulators are, however, capable of simulating at most few tens/hundreds of biophysically accurate neurons in real-time due to the exponential growth in the interneuron communication costs with the number of simulated neurons. In this paper, we propose a real-time, reconfigurable, multichip system architecture based on localized communication, which effectively reduces the communication cost to a linear growth. All parts of the system are generated automatically, based on the neuron connectivity scheme. Experimental results indicate that the proposed system architecture allows the capacity of over 3000 to 19 200 (depending on the connectivity scheme) biophysically accurate neurons over multiple chips. ...

Neuromorphic Spike Data Classifier for Reconfigurable Brain-Machine Interface

Conference paper (2017) - Amir Zjajo, Sumeet Kumar, Rene van Leuken

In this paper, we propose a reconfigurable neural spike classifier based on neuromorphic event-based networks that can be directly interfaced to neural signal conditioning and quantization circuits. The classifier is set as a heterogeneity based, multi-layer computational network to offer wide flexibility in the implementation of plastic and metaplastic interactions, and to increase efficacy in neural signal processing. Built-in temporal control mechanisms allow the implementation of homeostatic regulation in the resulting network. The results obtained in a 90 nm CMOS technology show that an efficient neural spike data classification can be obtained with a low power (9.4 μW/core) and compact (0.54 mm2 per core) structure. ...

A Wideband Linear Direct Digital RF Modulator using Harmonic Rejection and I/Q-Interleaving RF DACs

Conference paper (2017) - M. Mehrpoo, M. Hashemi, Y. Shen, R. van Leuken, M.S. Alavi, L.C.N. de Vreede

This paper presents a wideband linear direct digital RF modulator (DDRM) in 40nm CMOS technology. It features an advanced 2nd-order-hold interpolation filter and I/Q-interleaving harmonic rejection RF DACs. The 2×9-bit DDRM core occupies 0.21mm2 and consumes only 110mW at 1 GHz. Within the 0.9-3.1GHz frequency range, the peak output power reaches +9.2dBm and the 3rd/5th harmonic rejection, C-IMD3, and OIP3 are respectively better than 30 dB, -44 dBc, and +25 dBm. The EVM and ACPR at 3 GHz for a 57-MHz 64-QAM signal are better than -30 dB and -45 dB, respectively, and ACPR remains as low as -44 dBc up to a wide bandwidth of 110 MHz. ...

Immediate Neighbourhood Temperature Adaptive Routing for Dynamically Throttled 3-D Networks-on-Chip

Journal article (2017) - Sumeet S. Kumar, Amir Zjajo, Rene van Leuken

In this paper, we present the Immediate Neighbourhood Temperature (INT) routing algorithm which balances thermal profiles across dynamically-throttled 3D NoCs by adaptively routing interconnect traffic based on runtime temperature monitoring. INT avoids the overheads of system-wide temperature monitoring by relying on the heat transfer characteristics of 3D integrated circuits which enable temperature information from routers in the immediate neighbourhood to guide adaptive routing decisions. Experimental results indicate that INT yields balanced thermal profiles with upto 25% lower gradients than competing schemes, and shortens communication latencies by decreasing average network congestion by upto 50%, with negligible overheads. ...

Energy-Efficient Neuromorphic Receptors for Wide-Range Temporal Patterns of Post-Synaptic Responses

Conference paper (2017) - Xuefei You, Amir Zjajo, Sumeet S. Kumar, Rene van Leuken

In a neuromorphic integrated circuit synaptic dynamics are of great importance to capture accurate neural behaviors. In this paper, we propose a current-based synapse design mediated with multiple receptor types, namely AMPA, NMDA and GABAa, and a weight-dependent learning algorithm. Due to various biological conducting mechanisms, the receptors demonstrate different kinetics in response to stimulus. The designed circuit offers distinctive features of receptors as well as the joint synaptic function. An increased computation ability is verified through synchrony detection in a two-layer recurrent network of synapse clusters. The design implemented in TSMC 65 nm CMOS technology consumes 1.92, 3.36, 1.11 and 35.22 pJ per spike event of energy for AMPA, NMDA, GABAa receptors and the advanced learning circuit, respectively. ...

Fighting Dark Silicon

Toward Realizing Efficient Thermal-Aware 3-D Stacked Multiprocessors

Journal article (2017) - Sumeet S. Kumar, Amir Zjajo, Rene van Leuken

This paper investigates the challenges of dark silicon that impede the performance and reliability of 3-D stacked multiprocessors. It presents a multipronged approach toward addressing the thermal issues arising from high-density integration in die stacks, spanning architectural techniques, design methodologies, and runtime temperature management. Importantly, this paper provides novel insights into the causes of hotspot formation in 3-D ICs and details a practical approach toward exploring and mitigating performance-limiting thermal behavior early in the system design flow. ...

A Fully-Integrated Digital-Intensive Polar Doherty Transmitter

Conference paper (2017) - Yiyu Shen, Mohammadreza Mehrpoo, Mohsen Hashemi, Michael Polushkin, Lei Zhou, Mustafa Acar, Rene van Leuken, Morteza S. Alavi, Leo de Vreede

This paper presents an advanced 2.3-2.8 GHz fully-integrated digital-intensive polar Doherty transmitter realized in 40nm standard CMOS. The proposed architecture comprises CORDIC, digital delay aligners, interpolators, digital pre-distortion (DPD) circuitry in combination with frequency-agile wideband phase modulators followed by the digital main and peak power amplifier (PA) operating in quasi-load insensitive class-E using an on-chip power combiner. At 2.5 GHz, its maximum output power is +21.4 dBm. Drain efficiency is 49.4% at peak power, and 33.7% at 6-dB power back-off. Applying DPD for a 20-MHz 64-QAM signal, the measured EVM is better than -30 dB while the average drain efficiency is 24%. ...

Digital Spiking Neuron Cells for Real-Time Reconfigurable Learning Networks

Conference paper (2017) - Haipeng Lin, Amir Zjajo, Rene van Leuken

The high level of realism of spiking neuron networks and their complexity require a substantial computational resources limiting the size of the realized networks. Consequently, the main challenge in building complex and biologically-accurate spiking neuron network is largely set by the high computational and data transfer demands. In this paper, we implement several efficient models of the spiking neurons with characteristics such as axon conduction delays and spike timing-dependent plasticity. Experimental results indicate that the proposed real-time data-flow learning network architecture allows the capacity of over 2800 (depending on the model complexity) biophysically accurate neurons in a single FPGA device. ...

An Intrinsically Linear Wideband Digital Polar PA Featuring AM-AM and AM-PM Corrections Through Nonlinear Sizing, Overdrive-Voltage Control, and Multiphase RF Clocking

Conference paper (2017) - Mohsen Hashemi, Yiyu Shen, Mohammadreza Mehrpoo, Mustafa Acar, René van Leuken, Morteza S. Alavi, Leonardus de Vreede

To fully benefit from the progress of CMOS technologies, it is desirable to completely digitize the TX, replacing its final stage with a digitally controlled PA (DPA). The DPA consists of arrays of small sub-PAs that are digitally controlled to modulate the output amplitude, thus operating as an RF-DAC [1-6]. DPAs are normally designed in a switched mode (Classes E/D/D-1, etc.) to achieve high efficiency while using high sampling rate to attenuate and push the spectral images to higher frequencies. However, they suffer from high nonlinearity in their AM-code-word (ACW) to AM and ACW-to-PM conversion. To correct for such nonlinearities, digital pre-distortion (DPD) of the input signal is often used [1-3], typically implemented by look-up tables (LUT). Unfortunately, DPD approaches suffer from large signal-BW expansion due to their inherently nonlinear characteristics. This, combined with the already present BW regrowth in a polar TX in the AM and PM paths, yields significant hardware-speed/power constraints when the signal BW becomes large. For a Cartesian TX, the use of LUT-DPD is even more complicated since a full 2D LUT is typically required [2]. To relax the overall system complexity, it is highly desirable to have a PA with a maximum inherent linearity without compromising its power or efficiency. In this work, an ACW-AM correction based on nonlinear sizing along with controlling the peak voltage of RF clocks (overdrive voltage tuning) and a ACW-PM correction based on multiphase RF clocking are introduced to linearize the characteristic curves of a Class-E polar DPA with intent to avoid any kind of pre-distortion. ...