A. Singh
Please Note
17 records found
1
Recent advances in Resistive RAM (RRAM) based Computation-In-Memory (CIM) architectures highlight significant potential for accelerating data-intensive computing tasks. However, non-idealities in RRAM devices, such as variability, result in small sensing margins that can significantly affect the computational efficiency. This issue becomes even more pronounced when dealing with complex multi-operand logic operations. This paper introduces a circuit-level scheme for CIM-based multi-operand XOR logic operations, leveraging a Voltage-To-Time converter (VTC) to perform multi-phased XORs in a single clock cycle. In this approach, we exploit bitline capacitances for voltage-based sensing during computation, generating an output voltage that is linearly proportional to the operand values. This voltage is then converted into the desired logic output using the VTC design. Furthermore, low-power techniques are employed in the deployment of sense amplifiers, such as regulating power consumption during operation and disabling the amplifiers once the decision is made. Simulation results for a post-layout extracted 512x512 (256Kb) RRAM-based CIM array show that up to 16-operand XOR operation can be accurately and reliably performed as opposed to a maximum of three operands supported by state-of-the-art solutions, while offering up to 49× better figure-of-merit combining energy-efficiency and throughput.
Logic and arithmetic computation-in-memory accelerators
Based on memristor devices
Memristor-based CIM has been vastly explored to perform certain data-intensive computational tasks for numerous applications related to artificial intelligence (AI), Big Data, and data encryption while realizing high-performing, low-power micro-architectural solutions. This thesis first identifies the existing key challenges related to emerging memory technology and CIM memory array, and the circuit design of periphery logic to achieve low-power and high-performing CIM units. Thereafter, it presents several micro-architectural solutions to build CIM accelerators that can perform logic and arithmetic operations in an efficient manner.
Identifying the challenges of memristor-based CIM: The thesis briefly summarizes the key aspects of a memristor-based CIM architecture that can perform certain logic and arithmetic operations. First, the key components of a typical CIM architecture are presented, highlighting the design considerations to build these components. An account of computational accuracy and efficiency share of each of these components is presented supported by a comprehensive literature survey while highlighting the underlying concepts of each component. This is followed by identifying the key challenges in achieving stringent efficiency metrics required by the targeted applications while dealing with non-idealities. Finally, the limitations of the state-of-the-art solutions that target these challenges are highlighted, thus, providing the motivation to develop outperforming solutions that are presented in the thesis.
Developing micro-architectural solutions for efficient CIM-based logic accelerators: This part of the thesis presents several logic accelerators that can perform (N)OR, (N)AND, and XOR operations in a fast and energy-efficient manner. \textbf{FIVE} different solutions target different aspects of performing CIM-based logic, whereby, scaling the number of operands per cycle and the memory size, and expanding to different types of operations that can be supported while addressing the non-idealities. State-of-the-art solutions are comprehensively outperformed by our proposed solutions in terms of energy efficiency, performance, and the maximum number of operands that can be accurately performed in a single cycle.
Developing micro-architectural solutions for efficient CIM-based arithmetic accelerators: This part of the thesis presents several compact arithmetic accelerators that can perform multiply-and-accumulate (MAC) operations with low energy consumption. \textbf{THREE} different analog-to-digital converter (ADC) topologies are presented with two of them demonstrated with a chip prototype. Optimizing the memory structure is also explored to ensure accurate generation of analog output in the CIM memory array and to optimize the A/D conversion. A comparison of the proposed solutions with the state-of-the-art is presented, highlighting the promise of the developed micro-architectures. ...
Memristor-based CIM has been vastly explored to perform certain data-intensive computational tasks for numerous applications related to artificial intelligence (AI), Big Data, and data encryption while realizing high-performing, low-power micro-architectural solutions. This thesis first identifies the existing key challenges related to emerging memory technology and CIM memory array, and the circuit design of periphery logic to achieve low-power and high-performing CIM units. Thereafter, it presents several micro-architectural solutions to build CIM accelerators that can perform logic and arithmetic operations in an efficient manner.
Identifying the challenges of memristor-based CIM: The thesis briefly summarizes the key aspects of a memristor-based CIM architecture that can perform certain logic and arithmetic operations. First, the key components of a typical CIM architecture are presented, highlighting the design considerations to build these components. An account of computational accuracy and efficiency share of each of these components is presented supported by a comprehensive literature survey while highlighting the underlying concepts of each component. This is followed by identifying the key challenges in achieving stringent efficiency metrics required by the targeted applications while dealing with non-idealities. Finally, the limitations of the state-of-the-art solutions that target these challenges are highlighted, thus, providing the motivation to develop outperforming solutions that are presented in the thesis.
Developing micro-architectural solutions for efficient CIM-based logic accelerators: This part of the thesis presents several logic accelerators that can perform (N)OR, (N)AND, and XOR operations in a fast and energy-efficient manner. \textbf{FIVE} different solutions target different aspects of performing CIM-based logic, whereby, scaling the number of operands per cycle and the memory size, and expanding to different types of operations that can be supported while addressing the non-idealities. State-of-the-art solutions are comprehensively outperformed by our proposed solutions in terms of energy efficiency, performance, and the maximum number of operands that can be accurately performed in a single cycle.
Developing micro-architectural solutions for efficient CIM-based arithmetic accelerators: This part of the thesis presents several compact arithmetic accelerators that can perform multiply-and-accumulate (MAC) operations with low energy consumption. \textbf{THREE} different analog-to-digital converter (ADC) topologies are presented with two of them demonstrated with a chip prototype. Optimizing the memory structure is also explored to ensure accurate generation of analog output in the CIM memory array and to optimize the A/D conversion. A comparison of the proposed solutions with the state-of-the-art is presented, highlighting the promise of the developed micro-architectures.
Analog computation-in-memory (CIM) architecture alleviates massive data movement between the memory and the processor, thus promising great prospects to accelerate certain computational tasks in an energy-efficient manner. However, data converters involved in these architectures typically achieve the required computing accuracy at the expense of high area and energy footprint which can potentially determine CIM candidacy for low-power and compact edge-AI devices. In this work, we present a memory-periphery co-design to perform accurate A/D conversions of analog matrix-vector-multiplication (MVM) outputs. Here, we introduce a scheme where select-lines and bit-lines in the memory are virtually fixed to improve conversion accuracy and aid a ring-oscillator-based A/D conversion, equipped with component sharing and inter-matching of the reference blocks. In addition, we deploy a self-timed technique to further ensure high robustness addressing global design and cycle-to-cycle variations. Based on measurement results of a 4Kb CIM chip prototype equipped with TSMC 40nm, a relative accuracy of up to 99.71% is achieved with an energy efficiency of 115.1 TOPS/W and computational density of 12.1 TOPS/mm2 for the MNIST dataset. Thus, an improvement of up to 11.3X and 7.5X compared to the state-of-the-art, respectively.
Computation-in-memory (CIM) paradigm leverages emerging memory technologies such as resistive random access memories (RRAMs) to process the data within the memory itself. This alleviates the memory-processor bottleneck resulting in much higher hardware efficiency compared to von-Neumann architecture-based conventional hardware. Hence, CIM becomes an attractive alternative for applications like neural networks which require a huge number of data transfer operations in conventional hardware. CIM-based neural networks typically employ bit-slicing scheme which represents a single neural weight using multiple RRAM devices (called slices) to meet the high bit-precision demand. However, such neural networks suffer from significant accuracy degradation due to non-zero Gmin error where a zero weight in the neural network is represented by an RRAM device with a non-zero conductance. This paper proposes an unbalanced bit-slicing scheme to mitigate the impact of non-zero Gmin error. It achieves this by allocating appropriate sensing margins for different slices based on their binary positions. It also tunes the sensing margins to meet the demands of either high accuracy or energy-efficiency. The sensing margin allocation is supported by 2's complement arithmetic which further reduces the influence of non-zero Gmin error. Simulation results show that our proposed scheme achieves up to 7.3× accuracy and up to 7.8× correct operations per unit energy consumption compared to state-of-the-art.
Smart computing on edge-devices has demonstrated huge potential for various application sectors such as personalized healthcare and smart robotics. These devices aim at bringing smart computing close to the source where the data is generated or stored, while coping with the stringent resource budget of the edge platforms. The conventional Von-Neumann architecture fails to meet these requirements due to various limitations e.g., the memory-processor data transfer bottleneck. Memristor-based Computation-In-Memory (CIM) has the potential to realize such smart edge computing for data-dominated Artificial Intelligence (AI) applications by exploiting both the inherent properties of the architecture and the physical characteristics of the memristors. This paper discusses different aspects of CIM, including classification, working principle, CIM potentials and CIM design-flow. The design-flow is illustrated through two case studies to demonstrate the huge potential of CIM in realizing orders of magnitude improvement in energy-efficiency as compared to the conventional architectures. Finally future challenges and research directions of CIM are covered.
System Design for Computation-in-Memory
From Primitive to Complex Functions
Resistive random access memory (RRAM) based computation-in-memory (CIM) architectures are attracting a lot of attention due to their potential in performing fast and energy-efficient computing. However, the RRAM variability and non-idealities limit the computing accuracy of such architectures, especially for multi-operand logic operations. This paper pro-poses a voltage-based differential referencing-in-array scheme that enables accurate two and multi-operand logic operations for RRAM-based CIM architecture. The scheme makes use of a 2T2R cell configuration to create a complementary bitcell structure that inherently acts also as a reference during the operation execution; this results in a high sensing margin. More-over, the variation-sensitive multi-operand (N)AND operation is implemented using complementary-input (N)OR operation to further improve its accuracy. Simulation results for a post-layout extracted 512x512 (256Kb) RRAM-based CIM array show that up to 56 operand (N)OR/(N)AND operation can be accurately and reliably performed as opposed to a maximum of 4 operands supported by state-of-the-art solutions, while offering up to 11.4X better energy-efficiency.
Computation-in-memory using memristive devices is a promising approach to overcome the performance limitations of conventional computing architectures introduced by the von Neumann bottleneck which are also known as memory wall and power wall. It has been shown that accelerators based on memristive devices can deliver higher energy efficiencies and data throughputs when compared with conventional architectures. In the vast multitude of memristive devices, bipolar resistive switches based on the valence change mechanism (VCM) are particularly interesting due to their low power operation, non-volatility, high integration density and their CMOS compatibility. While a wide range of possible applications is considered, many of them such as artificial neural networks heavily rely on vector-matrix-multiplications (VMMs) as a mathematical operation. These VMMs are made up of large numbers of multiplication and accumulation (MAC) operations. The MAC operation can be realised using memristive devices in an analog fashion using Ohm’s law and Kirchhoff’s law. However, VCM devices exhibit a range of non-idealities, affecting the VMM performance, which in turn impacts the overall accuracy of the application. Those non-idealities can be classified into time-independent (programming variability) and time-dependent (read disturb and read noise). Additionally, peripheral circuits such as analog to digital converters can introduce errors during the digitalization. In this work, we experimentally and theoretically investigate the impact of device- and circuit-level effects on the VMM in a VCM crossbars. Our analysis shows that the variability of the low resistive state plays a key role and that reading in the RESET direction should be favored to reading in the SET direction.
Computation-in-Memory (CIM) is an emerging computing paradigm to address memory bottleneck challenges in computer architecture. A CIM unit cannot fully replace a general-purpose processor. Still, it significantly reduces the amount of data transfer between a traditional memory unit and the processor by enriching the transferred information. Data transactions between processor and memory consist of memory access addresses and values. While the main focus in the field of in-memory computing is to apply computations on the content of the memory (values), the importance of CPU-CIM address transactions and calculations for generating the sequence of access addresses for data-dominated applications is generally overlooked. However, the amount of information transactions used for "address"can easily be even more than half of the total transferred bits in many applications. In this article, we propose a circuit to perform the in-memory Address Calculation Accelerator. Our simulation results showed that calculating address sequences inside the memory (instead of the CPU) can significantly reduce the CPU-CIM address transactions and therefore contribute to considerable energy saving, latency, and bus traffic. For a chosen application of guided image filtering, in-memory address calculation results in almost two orders of magnitude reduction in address transactions over the memory bus.
Spin-transfer torque magnetic random access memory (STT-MRAM) based computation-in-memory (CIM) architectures have shown great prospects for an energy-efficient computing. However, device variations and non-idealities narrow down the sensing margin that severely impacts the computing accuracy. In this work, we propose an adaptive referencing mechanism to improve the sensing margin of a CIM architecture for boolean binary logic (BBL) operations. We generate reference signals using multiple STT-MRAM devices and place them strategically into the array such that these signals can address the variations and trace the wire parasitics effectively. We have demonstrated this behavior using an STT-MRAM model, which is calibrated using 1Mbit characterized array. Results show that our proposed architecture for binary neural networks (BNN) achieves up to 17.8 TOPS/W on the MNIST dataset and 130× performance improvement for the text encryption compared to the software implementation on Intel Haswell processor.
KrakenOnMem
A Memristor-Augmented HW/SW Framework for Taxonomic Profiling
State-of-the-art taxonomic profilers that comprise the first step in larger-context metagenomic studies have proven to be computationally intensive, i.e., while accurate, they come at the cost of high latency and energy consumption. Table Lookup operation is a primary bottleneck of today's profilers. In this paper, we first propose TL-PIM, a hardware accelerator based on the processing-in-memory (PIM) paradigm to accelerate Table Lookup. TL-PIM leverages the in-memory compute capability of emerging memory technologies along with intelligent data mapping. Then, we integrate TL-PIM into Kraken2, a state-of-the-art metagenomic profiler, and build an HW/SW co-designed profiler, called KrakenOnMem. Results from a silicon-based prototype of our emerging memory validate the design and required operations on a smaller scale. Our large-scale calibrated simulations show that KrakenOnMem can provide an average of 61.3% speedup compared to original Kraken2 for end-to-end profiling. Additionally, our design improves the energy consumption by orders of magnitude compared to the original Kraken2 while incurring a negligible area overhead.
We present a 256 × 256 in-memory compute (IMC) core designed and fabricated in 14-nm CMOS technology with backend-integrated multi-level phase change memory (PCM). It comprises 256 linearized current-controlled oscillator (CCO)-based A/D converters (ADCs) at a compact 4-μm pitch and a local digital processing unit (LDPU) performing affine scaling and ReLU operations. A frequency-linearization technique for CCO is introduced, which increases the maximum CCO frequency beyond 3 GHz, while ensuring accurate on-chip matrix-vector multiplications (MVMs). Moreover, the design and functionality of the digital ADC calibration procedure is described in detail and the MVM accuracy is quantified. Finally, the measured classification accuracies of deep learning (DL) inference applications on the MNIST and CIFAR-10 datasets, when two IMC cores are employed, are presented. For a performance density of 1.59 TOPS/mm2, a measured energy efficiency of 10.5 TOPS/W, at a main clock frequency of 1 GHz, is achieved.
SRIF
Scalable and Reliable Integrate and Fire Circuit ADC for Memristor-Based CIM Architectures
Emerging computation-in-memory (CIM) paradigm offers processing and storage of data at the same physical location, thus alleviating critical memory-processor communication bottlenecks suffered by conventional von-Neumann architecture. Storage of data in a CIM architecture is analog in nature and therefore computation is performed in analog domain i.e. inputs and outputs are analog values. Since the outside computing environment is digital, analog-to-digital converters (ADC) are utilized to perform the output data conversion. However, ADC designs are bulky, power-hungry circuits that are prone to design variations and therefore, play an important role in determining the computing efficiency of CIM architectures. In this paper, we present a scalable and reliable integrate and fire circuit ADC (SRIF-ADC) design for CIM architectures, suitable for stringent power and area constraints. We devise a technique to stabilize the node receiving analog inputs that allows more rows to be activated at the same time, thereby increasing the operand size of input vectors. This allows better scalability in terms of higher parallelism of operations. We employ a self-timed variation-aware design approach and design measures to drastically reduce read disturb of memristor devices that address reliability issues related to the ADC design. In addition, we present a compact, built-in sample-and-hold circuit to replace the large-sized capacitance and built-in weighting technique to alleviate the need for post-processing. For multiply-and-accumulate (MAC) operation, our simulation results show that we can improve the computational parallelism by 3X as well as ADC conversion speed and energy efficiency are improved by 2X and 11.6X, respectively, compared to the state-of-the-art design.
With the rise of the Internet of Things (IoT), a huge market for so-called smart edge-devices is foreseen for millions of applications, like personalized healthcare and smart robotics. These devices have to bring smart computing directly where the data is generated, while coping with the limited energy budget. Conventional von-Neumann architecture fail to meet these requirements due to e.g., memory-processor data transfer bottleneck. Memristor-based computation-in-memory (CIM) has the potential to realize smart local computing for highly parallel data-dominated AI applications by exploiting the inherent properties of the architecture and the physical characteristics of the memristors. This paper provides a broad overview of CIM architecture highlighting its potential and unique properties in enabling smart local computing. Moreover, it discusses design considerations of such architectures including both crossbar array as well as peripheral circuits; special attention is given to analog-to-digital converter (ADC), as it is the most critical unit of analog-based CIM operation e.g., vector-matrix multiplication (VMM). Finally, the paper outlines the potential future directions for CIM-based edge smart computing.