AS

A. Singh

info

Please Note

17 records found

Conference paper (2026) - Y. Biyani, A. Singh, R. Bishnoi, S. Hamdioui
Analog Compute-in-Memory (CIM), leveraging non-volatile memristive devices to perform in-place computations in the analog domain, holds great potential to efficiently accelerate vector-matrix multiplications (VMM) and realize AI (Artificial Intelligence) at the edge. However, the data converters in such architectures often trade-off accuracy for high energy and area overheads, practically limiting the benefits of CIM. In this work, we present SABCIM, an array-periphery co-design approach for CIM that enables accurate computation as well as digitization of analog VMM outputs with high energy efficiency and competitive area overhead. By leveraging complementary input activations and data storage, each crossbar column generates differential analog output corresponding to the vector-vector multiplication (VVM) result, while inherently addressing underlying non-idealities. This is digitized using a compact, dual-ramp voltage-to-time converter (VTC)-based analog-to-digital converter (ADC). Benchmark results indicate that our work achieves up to $19.6 \times$ higher energy efficiency compared to state-of-the-art (SOTA), while maintaining comparable accuracies. ...
Recent advances in Resistive RAM (RRAM) based Computation-In-Memory (CIM) architectures highlight significant potential for accelerating data-intensive computing tasks. However, non-idealities in RRAM devices, such as variability, result in small sensing margins that can significantly affect the computational efficiency. This issue becomes even more pronounced when dealing with complex multi-operand logic operations. This paper introduces a circuit-level scheme for CIM-based multi-operand XOR logic operations, leveraging a Voltage-To-Time converter (VTC) to perform multi-phased XORs in a single clock cycle. In this approach, we exploit bitline capacitances for voltage-based sensing during computation, generating an output voltage that is linearly proportional to the operand values. This voltage is then converted into the desired logic output using the VTC design. Furthermore, low-power techniques are employed in the deployment of sense amplifiers, such as regulating power consumption during operation and disabling the amplifiers once the decision is made. Simulation results for a post-layout extracted 512x512 (256Kb) RRAM-based CIM array show that up to 16-operand XOR operation can be accurately and reliably performed as opposed to a maximum of three operands supported by state-of-the-art solutions, while offering up to 49× better figure-of-merit combining energy-efficiency and throughput. ...
Doctoral thesis (2024) - A. Singh, S. Hamdioui, R.V. Joshi, R.K. Bishnoi
Conventional computing systems involve physically separated storing and processing units. To perform the processing, data is shuttled from the storing unit to the processing unit followed by the actual processing, and the processed data is shuttled back into the storing unit. Unfortunately, this data shuffling contributes significantly to the overall latency and energy consumption of the system. Computation-in-memory (CIM) offers a promising alternative that can process the data within the storing unit, thereby, alleviating the need for data shuffling. This can potentially lead to high energy efficiency and high throughput computation. In addition, emerging non-volatile memristors provide an excellent storing element that can inherently perform the computation in CIM while also retaining the data value.
Memristor-based CIM has been vastly explored to perform certain data-intensive computational tasks for numerous applications related to artificial intelligence (AI), Big Data, and data encryption while realizing high-performing, low-power micro-architectural solutions. This thesis first identifies the existing key challenges related to emerging memory technology and CIM memory array, and the circuit design of periphery logic to achieve low-power and high-performing CIM units. Thereafter, it presents several micro-architectural solutions to build CIM accelerators that can perform logic and arithmetic operations in an efficient manner.

Identifying the challenges of memristor-based CIM: The thesis briefly summarizes the key aspects of a memristor-based CIM architecture that can perform certain logic and arithmetic operations. First, the key components of a typical CIM architecture are presented, highlighting the design considerations to build these components. An account of computational accuracy and efficiency share of each of these components is presented supported by a comprehensive literature survey while highlighting the underlying concepts of each component. This is followed by identifying the key challenges in achieving stringent efficiency metrics required by the targeted applications while dealing with non-idealities. Finally, the limitations of the state-of-the-art solutions that target these challenges are highlighted, thus, providing the motivation to develop outperforming solutions that are presented in the thesis.

Developing micro-architectural solutions for efficient CIM-based logic accelerators: This part of the thesis presents several logic accelerators that can perform (N)OR, (N)AND, and XOR operations in a fast and energy-efficient manner. \textbf{FIVE} different solutions target different aspects of performing CIM-based logic, whereby, scaling the number of operands per cycle and the memory size, and expanding to different types of operations that can be supported while addressing the non-idealities. State-of-the-art solutions are comprehensively outperformed by our proposed solutions in terms of energy efficiency, performance, and the maximum number of operands that can be accurately performed in a single cycle.

Developing micro-architectural solutions for efficient CIM-based arithmetic accelerators: This part of the thesis presents several compact arithmetic accelerators that can perform multiply-and-accumulate (MAC) operations with low energy consumption. \textbf{THREE} different analog-to-digital converter (ADC) topologies are presented with two of them demonstrated with a chip prototype. Optimizing the memory structure is also explored to ensure accurate generation of analog output in the CIM memory array and to optimize the A/D conversion. A comparison of the proposed solutions with the state-of-the-art is presented, highlighting the promise of the developed micro-architectures. ...
Journal article (2023) - Anteneh Gebregiorgis, Abhairaj Singh, Amirreza Yousefzadeh, Dirk Wouters, Rajendra Bishnoi, Francky Catthoor, Said Hamdioui
Smart computing on edge-devices has demonstrated huge potential for various application sectors such as personalized healthcare and smart robotics. These devices aim at bringing smart computing close to the source where the data is generated or stored, while coping with the stringent resource budget of the edge platforms. The conventional Von-Neumann architecture fails to meet these requirements due to various limitations e.g., the memory-processor data transfer bottleneck. Memristor-based Computation-In-Memory (CIM) has the potential to realize such smart edge computing for data-dominated Artificial Intelligence (AI) applications by exploiting both the inherent properties of the architecture and the physical characteristics of the memristors. This paper discusses different aspects of CIM, including classification, working principle, CIM potentials and CIM design-flow. The design-flow is illustrated through two case studies to demonstrate the huge potential of CIM in realizing orders of magnitude improvement in energy-efficiency as compared to the conventional architectures. Finally future challenges and research directions of CIM are covered. ...
Analog computation-in-memory (CIM) architecture alleviates massive data movement between the memory and the processor, thus promising great prospects to accelerate certain computational tasks in an energy-efficient manner. However, data converters involved in these architectures typically achieve the required computing accuracy at the expense of high area and energy footprint which can potentially determine CIM candidacy for low-power and compact edge-AI devices. In this work, we present a memory-periphery co-design to perform accurate A/D conversions of analog matrix-vector-multiplication (MVM) outputs. Here, we introduce a scheme where select-lines and bit-lines in the memory are virtually fixed to improve conversion accuracy and aid a ring-oscillator-based A/D conversion, equipped with component sharing and inter-matching of the reference blocks. In addition, we deploy a self-timed technique to further ensure high robustness addressing global design and cycle-to-cycle variations. Based on measurement results of a 4Kb CIM chip prototype equipped with TSMC 40nm, a relative accuracy of up to 99.71% is achieved with an energy efficiency of 115.1 TOPS/W and computational density of 12.1 TOPS/mm2 for the MNIST dataset. Thus, an improvement of up to 11.3X and 7.5X compared to the state-of-the-art, respectively. ...
Journal article (2023) - Sumit Diware, Abhairaj Singh, Anteneh Gebregiorgis, Rajiv V. Joshi, Said Hamdioui, Rajendra Bishnoi
Computation-in-memory (CIM) paradigm leverages emerging memory technologies such as resistive random access memories (RRAMs) to process the data within the memory itself. This alleviates the memory-processor bottleneck resulting in much higher hardware efficiency compared to von-Neumann architecture-based conventional hardware. Hence, CIM becomes an attractive alternative for applications like neural networks which require a huge number of data transfer operations in conventional hardware. CIM-based neural networks typically employ bit-slicing scheme which represents a single neural weight using multiple RRAM devices (called slices) to meet the high bit-precision demand. However, such neural networks suffer from significant accuracy degradation due to non-zero Gmin error where a zero weight in the neural network is represented by an RRAM device with a non-zero conductance. This paper proposes an unbalanced bit-slicing scheme to mitigate the impact of non-zero Gmin error. It achieves this by allocating appropriate sensing margins for different slices based on their binary positions. It also tunes the sensing margins to meet the demands of either high accuracy or energy-efficiency. The sensing margin allocation is supported by 2's complement arithmetic which further reduces the influence of non-zero Gmin error. Simulation results show that our proposed scheme achieves up to 7.3× accuracy and up to 7.8× correct operations per unit energy consumption compared to state-of-the-art. ...
Computation-In-Memory (CIM) using memristor devices provides an energy-efficient hardware implementation of arithmetic and logic operations for numerous applications, such as neuromorphic computing and database query. However, memristor-based CIM suffers from various non-idealities such as conductance drift, read disturb, wire parasitics, endurance and device degradation. These negatively impact the computation accuracy of CIM. It is therefore essential to deal with these non-idealities and fabrication imperfections in order to harness the full potential of CIM. This paper discusses the non-ideality challenges and provides potential solutions. Furthermore, the paper outlines the potential future directions for CIM architectures. ...
Journal article (2022) - Riduan Khaddam-Aljameh, Milos Stanisavljevic, Jordi Fornt Mas, Geethan Karunaratne, Matthias Brandli, Feng Liu, Abhairaj Singh, Silvia M. Muller, Urs Egger, More authors...
We present a 256 × 256 in-memory compute (IMC) core designed and fabricated in 14-nm CMOS technology with backend-integrated multi-level phase change memory (PCM). It comprises 256 linearized current-controlled oscillator (CCO)-based A/D converters (ADCs) at a compact 4-μm pitch and a local digital processing unit (LDPU) performing affine scaling and ReLU operations. A frequency-linearization technique for CCO is introduced, which increases the maximum CCO frequency beyond 3 GHz, while ensuring accurate on-chip matrix-vector multiplications (MVMs). Moreover, the design and functionality of the digital ADC calibration procedure is described in detail and the MVM accuracy is quantified. Finally, the measured classification accuracies of deep learning (DL) inference applications on the MNIST and CIFAR-10 datasets, when two IMC cores are employed, are presented. For a performance density of 1.59 TOPS/mm2, a measured energy efficiency of 10.5 TOPS/W, at a main clock frequency of 1 GHz, is achieved. ...
Conference paper (2022) - Abhairaj Singh, Mahdi Zahedi, Taha Shahroodi, Mohit Gupta, Anteneh Gebregiorgis, Manu Komalan, Rajiv V. Joshi, Francky Catthoor, Rajendra Bishnoi, Said Hamdioui
Spin-transfer torque magnetic random access memory (STT-MRAM) based computation-in-memory (CIM) architectures have shown great prospects for an energy-efficient computing. However, device variations and non-idealities narrow down the sensing margin that severely impacts the computing accuracy. In this work, we propose an adaptive referencing mechanism to improve the sensing margin of a CIM architecture for boolean binary logic (BBL) operations. We generate reference signals using multiple STT-MRAM devices and place them strategically into the array such that these signals can address the variations and trace the wire parasitics effectively. We have demonstrated this behavior using an STT-MRAM model, which is calibrated using 1Mbit characterized array. Results show that our proposed architecture for binary neural networks (BNN) achieves up to 17.8 TOPS/W on the MNIST dataset and 130× performance improvement for the text encryption compared to the software implementation on Intel Haswell processor. ...
Emerging non-volatile resistive RAM (RRAM) device technology has shown great potential to cultivate not only high-density memory storage, but also energy-efficient computing units. However, the unique challenges related to RRAM fabrication process render the traditional memory testing solutions inefficient and inadequate for high product quality. This paper presents low-cost design-for-testability (DFT) solutions that augment the testing process and improve the fault coverage. A computation-in-memory (CIM) based DFT is realized to expedite the detection and diagnosis of faults by developing logic designs involving multi-row activation. A novel addressing scheme is introduced to facilitate the diagnosis of faults. Reconfigurable logic designs are developed to detect unique RRAM faults that offer features such as programmable reference generations, period, and voltage of operation. DFT implementations are validated on a post-layout extracted platform and testing sequences are introduced by incorporating the proposed DFTs. Results show that more than 2.3× speedup and better coverage are achieved with 6× area reduction when compared with state-of-the-art solutions. ...
Conference paper (2022) - Abhairaj Singh, Rajendra Bishnoi, Rajiv V. Joshi, Said Hamdioui
Resistive random access memory (RRAM) based computation-in-memory (CIM) architectures are attracting a lot of attention due to their potential in performing fast and energy-efficient computing. However, the RRAM variability and non-idealities limit the computing accuracy of such architectures, especially for multi-operand logic operations. This paper pro-poses a voltage-based differential referencing-in-array scheme that enables accurate two and multi-operand logic operations for RRAM-based CIM architecture. The scheme makes use of a 2T2R cell configuration to create a complementary bitcell structure that inherently acts also as a reference during the operation execution; this results in a high sensing margin. More-over, the variation-sensitive multi-operand (N)AND operation is implemented using complementary-input (N)OR operation to further improve its accuracy. Simulation results for a post-layout extracted 512x512 (256Kb) RRAM-based CIM array show that up to 56 operand (N)OR/(N)AND operation can be accurately and reliably performed as opposed to a maximum of 4 operands supported by state-of-the-art solutions, while offering up to 11.4X better energy-efficiency. ...

A Memristor-Augmented HW/SW Framework for Taxonomic Profiling

State-of-the-art taxonomic profilers that comprise the first step in larger-context metagenomic studies have proven to be computationally intensive, i.e., while accurate, they come at the cost of high latency and energy consumption. Table Lookup operation is a primary bottleneck of today's profilers. In this paper, we first propose TL-PIM, a hardware accelerator based on the processing-in-memory (PIM) paradigm to accelerate Table Lookup. TL-PIM leverages the in-memory compute capability of emerging memory technologies along with intelligent data mapping. Then, we integrate TL-PIM into Kraken2, a state-of-the-art metagenomic profiler, and build an HW/SW co-designed profiler, called KrakenOnMem. Results from a silicon-based prototype of our emerging memory validate the design and required operations on a smaller scale. Our large-scale calibrated simulations show that KrakenOnMem can provide an average of 61.3% speedup compared to original Kraken2 for end-to-end profiling. Additionally, our design improves the energy consumption by orders of magnitude compared to the original Kraken2 while incurring a negligible area overhead. ...

From Primitive to Complex Functions

In recent years, we are witnessing a trend moving away from conventional computer architectures towards Computation-In-Memory (CIM) based on emerging memristor devices. This is due to the fact that the performance and energy efficiency of traditional computer architectures can no longer be increased at the same pace as before. The main barriers which limit the performance and energy improvement are the memory and power walls. Thus far, the main effort from researchers is toward enabling CIM as an accelerator for specific applications. Consequently, this current application-specific nature/approach has put less emphasis on the potential general-purpose applicability of CIM, i.e., merging several accelerators into one that is less than the sum of the parts. In this paper, we demonstrate the CIM concept using a broader and generalized model. Considering this model, the state-of-the-art CIM-based logic and arithmetic primitive functions, which can be the building blocks for complex functions, are investigated. Besides, we present potential applications of CIM which provides insights into the challenges and opportunities of a generic CIM system design. Finally, we highlight the future directions regarding the construction of CIM-based systems. ...
Journal article (2022) - Amirreza Yousefzadeh, Jan Stuijt, Martijn Hijdra, Hsiao-Hsuan Liu, Anteneh Gebregiorgis, Abhairaj Singh, Said Hamdioui, Francky Catthoor
Computation-in-Memory (CIM) is an emerging computing paradigm to address memory bottleneck challenges in computer architecture. A CIM unit cannot fully replace a general-purpose processor. Still, it significantly reduces the amount of data transfer between a traditional memory unit and the processor by enriching the transferred information. Data transactions between processor and memory consist of memory access addresses and values. While the main focus in the field of in-memory computing is to apply computations on the content of the memory (values), the importance of CPU-CIM address transactions and calculations for generating the sequence of access addresses for data-dominated applications is generally overlooked. However, the amount of information transactions used for "address"can easily be even more than half of the total transferred bits in many applications. In this article, we propose a circuit to perform the in-memory Address Calculation Accelerator. Our simulation results showed that calculating address sequences inside the memory (instead of the CPU) can significantly reduce the CPU-CIM address transactions and therefore contribute to considerable energy saving, latency, and bus traffic. For a chosen application of guided image filtering, in-memory address calculation results in almost two orders of magnitude reduction in address transactions over the memory bus. ...
Journal article (2022) - Christopher Bengel, Johannes Mohr, Stefan Wiefels, Abhairaj Singh, Anteneh Gebregiorgis, Rajendra Bishnoi, Said Hamdioui, Rainer Waser, Dirk Wouters, Stephan Menzel
Computation-in-memory using memristive devices is a promising approach to overcome the performance limitations of conventional computing architectures introduced by the von Neumann bottleneck which are also known as memory wall and power wall. It has been shown that accelerators based on memristive devices can deliver higher energy efficiencies and data throughputs when compared with conventional architectures. In the vast multitude of memristive devices, bipolar resistive switches based on the valence change mechanism (VCM) are particularly interesting due to their low power operation, non-volatility, high integration density and their CMOS compatibility. While a wide range of possible applications is considered, many of them such as artificial neural networks heavily rely on vector-matrix-multiplications (VMMs) as a mathematical operation. These VMMs are made up of large numbers of multiplication and accumulation (MAC) operations. The MAC operation can be realised using memristive devices in an analog fashion using Ohm’s law and Kirchhoff’s law. However, VCM devices exhibit a range of non-idealities, affecting the VMM performance, which in turn impacts the overall accuracy of the application. Those non-idealities can be classified into time-independent (programming variability) and time-dependent (read disturb and read noise). Additionally, peripheral circuits such as analog to digital converters can introduce errors during the digitalization. In this work, we experimentally and theoretically investigate the impact of device- and circuit-level effects on the VMM in a VCM crossbars. Our analysis shows that the variability of the low resistive state plays a key role and that reading in the RESET direction should be favored to reading in the SET direction. ...

Scalable and Reliable Integrate and Fire Circuit ADC for Memristor-Based CIM Architectures

Emerging computation-in-memory (CIM) paradigm offers processing and storage of data at the same physical location, thus alleviating critical memory-processor communication bottlenecks suffered by conventional von-Neumann architecture. Storage of data in a CIM architecture is analog in nature and therefore computation is performed in analog domain i.e. inputs and outputs are analog values. Since the outside computing environment is digital, analog-to-digital converters (ADC) are utilized to perform the output data conversion. However, ADC designs are bulky, power-hungry circuits that are prone to design variations and therefore, play an important role in determining the computing efficiency of CIM architectures. In this paper, we present a scalable and reliable integrate and fire circuit ADC (SRIF-ADC) design for CIM architectures, suitable for stringent power and area constraints. We devise a technique to stabilize the node receiving analog inputs that allows more rows to be activated at the same time, thereby increasing the operand size of input vectors. This allows better scalability in terms of higher parallelism of operations. We employ a self-timed variation-aware design approach and design measures to drastically reduce read disturb of memristor devices that address reliability issues related to the ADC design. In addition, we present a compact, built-in sample-and-hold circuit to replace the large-sized capacitance and built-in weighting technique to alleviate the need for post-processing. For multiply-and-accumulate (MAC) operation, our simulation results show that we can improve the computational parallelism by 3X as well as ADC conversion speed and energy efficiency are improved by 2X and 11.6X, respectively, compared to the state-of-the-art design. ...
Conference paper (2021) - Abhairaj Singh, Sumit Diware, Anteneh Gebregiorgis, Rajendra Bishnoi, Francky Catthoor, Rajiv V. Joshi, Said Hamdioui
With the rise of the Internet of Things (IoT), a huge market for so-called smart edge-devices is foreseen for millions of applications, like personalized healthcare and smart robotics. These devices have to bring smart computing directly where the data is generated, while coping with the limited energy budget. Conventional von-Neumann architecture fail to meet these requirements due to e.g., memory-processor data transfer bottleneck. Memristor-based computation-in-memory (CIM) has the potential to realize smart local computing for highly parallel data-dominated AI applications by exploiting the inherent properties of the architecture and the physical characteristics of the memristors. This paper provides a broad overview of CIM architecture highlighting its potential and unique properties in enabling smart local computing. Moreover, it discusses design considerations of such architectures including both crossbar array as well as peripheral circuits; special attention is given to analog-to-digital converter (ADC), as it is the most critical unit of analog-based CIM operation e.g., vector-matrix multiplication (VMM). Finally, the paper outlines the potential future directions for CIM-based edge smart computing. ...