J. Yu | TU Delft Repository

A Survey on Memory-centric Computer Architectures

Journal article (2022) - A.B. Gebregiorgis, H.A. Du Nguyen, J. Yu, R.K. Bishnoi, M. Taouil, Catthoor Franky, S. Hamdioui

Faster and cheaper computers have been constantly demanding technological and architectural improvements. However, current technology is suffering from three technology walls: leakage wall, reliability wall, and cost wall. Meanwhile, existing architecture performance is also saturating due to three well-known architecture walls: memory wall, power wall, and instruction-level parallelism (ILP) wall. Hence, a lot of novel technologies and architectures have been introduced and developed intensively. Our previous work has presented a comprehensive classification and broad overview of memory-centric computer architectures. In this article, we aim to discuss the most important classes of memory-centric architectures thoroughly and evaluate their advantages and disadvantages. Moreover, for each class, the article provides a comprehensive survey on memory-centric architectures available in the literature. ...

Computation-in-Memory

From Circuits to Compilers

Doctoral thesis (2021) - J. Yu

Memristive devices are promising candidates as a complement to CMOS devices. These devices come with several advantages such as non-volatility, high density, good scalability, and CMOS compatibility. They enable in-memory computing paradigms since they can be used for both storing and computing. However, building in-memory computing systems using memristive devices is still in an early research stage. Therefore, challenges still exist with respect to the development of devices, circuits, architectures, design automation, and applications. This thesis focuses on developing memristive device-based circuits, their usage in in-memory computing architectures, and design automation methodologies to create or use such circuits. Circuit Level – We propose two logical operation schemes based on memristive devices. The first one uses resistive sensing to perform logical operations. It modifies the sense amplifier in such a way that it can compare the overall current with references and output the logical operation result. During sensing, the resistance of memristive devices remains unchanged. Therefore, endurance and lifetime are not reduced. This scheme provides a solution for maintaining a relatively long lifetime in logic operations for memristive devices that have low endurance. The second scheme is the enhanced version of the first one. It uses two different sensing paths for AND and OR operations. In this way, the correctness of logic operations can be guaranteed even if large resistance variation exists in memristive devices. Architecture Level – We present three in-memory computing architectures based on memristive devices. The first one is a heterogeneous architecture containing an accelerator for vector bit-wise logical operations and a CPU. The accelerator communicates with the CPU or accesses the external memory directly. The second one is to accelerate automata processing. In this architecture, memristive memory arrays store configuration information and conduct computation as well. This architecture outperforms similar ones that are built with conventional memory technologies. The third one is an improved version of the second one. It breaks the routing network into multiple pipeline stages, each processing a different input sequence. In this way, the architecture achieves a higher throughput with a negligible area overhead. Design Automation – A synthesis flow for computation-in-memory architectures and a compiler for automata processors are presented. The synthesis flow is proposed based on the concept of skeletons, which relates an algorithmic structure to a pre-defined solution template. This solution template contains scheduling, placement, and routing information needed for the hardware generation. After the user rewrites the algorithm using skeletons, the tool generates the desired circuit by instantiating the solution template. The automata processor compiler generates configuration bits according to the input automata. It uses multiple strategies to transform given automata, so that constraint conflicts can be resolved automatically. It also optimizes the mapping for storage utilization. ...

Memristive devices are promising candidates as a complement to CMOS devices. These devices come with several advantages such as non-volatility, high density, good scalability, and CMOS compatibility. They enable in-memory computing paradigms since they can be used for both storing and computing. However, building in-memory computing systems using memristive devices is still in an early research stage. Therefore, challenges still exist with respect to the development of devices, circuits, architectures, design automation, and applications. This thesis focuses on developing memristive device-based circuits, their usage in in-memory computing architectures, and design automation methodologies to create or use such circuits. Circuit Level – We propose two logical operation schemes based on memristive devices. The first one uses resistive sensing to perform logical operations. It modifies the sense amplifier in such a way that it can compare the overall current with references and output the logical operation result. During sensing, the resistance of memristive devices remains unchanged. Therefore, endurance and lifetime are not reduced. This scheme provides a solution for maintaining a relatively long lifetime in logic operations for memristive devices that have low endurance. The second scheme is the enhanced version of the first one. It uses two different sensing paths for AND and OR operations. In this way, the correctness of logic operations can be guaranteed even if large resistance variation exists in memristive devices. Architecture Level – We present three in-memory computing architectures based on memristive devices. The first one is a heterogeneous architecture containing an accelerator for vector bit-wise logical operations and a CPU. The accelerator communicates with the CPU or accesses the external memory directly. The second one is to accelerate automata processing. In this architecture, memristive memory arrays store configuration information and conduct computation as well. This architecture outperforms similar ones that are built with conventional memory technologies. The third one is an improved version of the second one. It breaks the routing network into multiple pipeline stages, each processing a different input sequence. In this way, the architecture achieves a higher throughput with a negligible area overhead. Design Automation – A synthesis flow for computation-in-memory architectures and a compiler for automata processors are presented. The synthesis flow is proposed based on the concept of skeletons, which relates an algorithmic structure to a pre-defined solution template. This solution template contains scheduling, placement, and routing information needed for the hardware generation. After the user rewrites the algorithm using skeletons, the tool generates the desired circuit by instantiating the solution template. The automata processor compiler generates configuration bits according to the input automata. It uses multiple strategies to transform given automata, so that constraint conflicts can be resolved automatically. It also optimizes the mapping for storage utilization.

APmap

An Open-Source Compiler for Automata Processors

Journal article (2021) - Jintao Yu, Muath Abu Lebdeh, Hoang Anh Du Nguyen, Mottaqiallah Taouil, Said Hamdioui

A novel type of hardware accelerators called automata processors (APs) have been proposed to accelerate finite-state automata. The bone structure of an AP is a hierarchical routing matrix that connects many memory arrays. With this structure, an AP can process an input symbol every clock cycle, and hence achieve much higher performance compared to conventional architectures. However, the design automation for the APs is not well researched. This article proposes a fully automated tool named APmap for mapping the automata to APs that use a two-level routing matrix. APmap first partitions a large automaton into small graphs and then maps them. Multiple transformations are applied to the automaton by APmap to meet hardware constraints. The experiments on a standard benchmark suite show that our approach leads to around 19% less storage utilization compared to state-of-the-art. ...

Skeleton-based Synthesis Flow for Computation-In-Memory Architectures

Journal article (2020) - Jintao Yu, Razvan Nane, Imran Ashraf, Mottaqiallah Taouil, Said Hamdioui, Henk Corporaal, Koen Bertels

Memristor-based Computation-in-Memory (CIM) is one of the emerging architectures for next-generation Big Data problems. Its design requires a radically new synthesis flow as the memristor is a passive device that uses resistances to encode its logic values. This article proposes a synthesis flow for mapping parallel applications on memristor-based CIM architecture. First, it employs solution templates that contain scheduling, placement, and routing information to map multiple algorithms with similar data flow graphs to the memristor crossbar; this template is named skeleton. Complex algorithms that do not fit a single skeleton can be solved by nested skeletons. Therefore, this approach can be applied to a wide range of applications while using a limited number of skeletons only. Second, it further improves the design when spatial and temporal patterns exist in input data. To accelerate simulation of generated SystemC models, we integrate MPI in skeletons. The synthesis flow and its additional features are verified with multiple applications, and the results are compared against a multicore platform. These experiments demonstrate the feasibility and the potential of this approach. ...

The Power of Computation-in-Memory Based on Memristive Devices

Conference paper (2020) - Jintao Yu, Muath Abu Lebdeh, Hoang Anh Du Nguyen, Mottaqiallah Taouil, Said Hamdioui

Conventional computing architectures and the CMOS technology that they are based on are facing major challenges such as the memory bottleneck making the memory access for data transfer a major killer of energy and performance. Computation-in-memory (CIM) paradigm is seen as a potential alternative that could alleviate such problems by adding computational resources to the memory, and significantly reducing the communication. Memristive devices are promising enablers of a such CIM paradigm, as they are able to support both storage and computing. This paper shows the power of memristive device based CIM paradigm in enabling new efficient application-specific architectures as well as efficient implementations of some known domain-specific architectures. In addition, the paper discusses the potential applications that could benefit from such paradigm and highlights the major challenges. ...

Enhanced Scouting Logic

A Robust Memristive Logic Design Scheme

Conference paper (2020) - Jintao Yu, Hoang Anh Du Nguyen, Muath Abu Lebdeh, Mottaqiallah Taouil, Said Hamdioui

Memristive devices have the potential to reduce the memory access bottleneck in conventional computer architectures. However, memristive devices also suffer from low endurance and large resistance variation. To address these problems, we present a robust logic scheme named Enhanced Scouting Logic (ESL). It produces logic operation results within the peripheral circuit of the memory array. During the execution of logic operations, the resistance states of memristive devices do not change and hence do not affect the memristor lifetime. ESL senses the resistance of input memristive devices via two different paths when different operations such as AND and OR are performed. These different paths guarantee the operation correctness even under large resistance variations. We verified ESL using SPICE simulations and Monte Carlo analysis. Our simulation results show that ESL is more robust as compared with state-of-the-art logic schemes. ...

A Classification of Memory-Centric Computing

Journal article (2020) - H.A. Du Nguyen, J. Yu, M.F.M. Abu Lebdeh, M. Taouil, S. Hamdioui, Francky Catthoor

Technological and architectural improvements have been constantly required to sustain the demand of faster and cheaper computers. However, CMOS down-scaling is suffering from three technology walls: leakage wall, reliability wall, and cost wall. On top of that, a performance increase due to architectural improvements is also
gradually saturating due to three well-known architecture walls: memory wall, power wall, and instruction level parallelism (ILP) wall. Hence, a lot of research is focusing on proposing and developing new technologies and architectures. In this article, we present a comprehensive classification of memory-centric computing architectures; it is based on three metrics: computation location, level of parallelism, and used memory technology. The classification not only provides an overview of existing architectures with their pros and cons but also unifies the terminology that uniquely identifies these architectures and highlights the potential future architectures that can be further explored. Hence, it sets up a direction for future research in the field. ...

CIM-SIM

Computation in Memory SIMuIator

Conference paper (2019) - Ali Banagozar, Kanishkan Vadivel, Sander Stuijk, Henk Corporaal, Stephan Wong, Muath Abu Lebdeh, Jintao Yu, Said Hamdioui

Computation-in-memory reverses the trend in von-Neumann processors by bringing the computation closer to the data, to even within the memory array, as opposed to introducing new memory hierarchies to keep (frequently used) data closer to a central processing unit (CPU). In recent years, new non-volatile memory (NVM) technologies, e.g., memristor, PCM, etc., have proven that they can function as memories and perform computations on the stored data as well. In particular, when they are combined with a modest set of (digital) peripheral modules, a wider range of operations can be supported, e.g., vector matrix multiply and Boolean logic. In this paper, we are introducing the CIM-SIM, an open source simulator written in SystemC, which is capable of simulating the functional behaviour of such architectures. The architecture includes the definition of a set of technology-agnostic nano-instructions. ...

Time-division Multiplexing Automata Processor

Conference paper (2019) - Jintao Yu, Hoang Anh Du Nguyen, Muath Abu Lebdeh, Mottaqiallah Taouil, Said Hamdioui

Automata Processor (AP) is a special implementation of non-deterministic finite automata that performs pattern matching by exploring parallel state transitions. The implementation typically contains a hierarchical switching network, causing long latency. This paper proposes a methodology to split such a hierarchical switching network into multiple pipelined stages, making it possible to process several input sequences in parallel by using time-division multiplexing. We use a new resistive RAM based AP (instead of known DRAM or SRAM based) to illustrate the potential of our method. The experimental results show that our approach increases the throughput by almost a factor of 2 at a cost of marginal area overhead. ...

A Computation-In-Memory Accelerator Based on Resistive Devices

Conference paper (2019) - Hoang Anh Du Nguyen, Jintao Yu, Muath Abu Lebdeh, Mottaqiallah Taouil, Said Hamdioui

Today's computing architectures suffer from the three well-known bottlenecks, which are the memory, the power and the instruction-level parallelism walls. Emerging non-volatile technologies, such as memristor, enable new resistive architectures that alleviate at least two of such bottlenecks, as they can process data within the memory with almost no leakage. In this paper, we propose a novel resistive computing architecture by extending a conventional architecture with a resistive based Computation-In-Memory accelerator (CIMX). We evaluate the delay, energy and area of the conventional and CIMX architecture using an analytical model and a simulation framework. The results (both based on the analytical model and simulation framework) show that the proposed architecture achieves at least one order of magnitude improvement in terms of performance, area, and energy efficiency for the considered benchmarks. ...

Memristive devices for computation-in-memory

Conference paper (2018) - Jintao Yu, Hoang Anh Du Nguyen, Lei Xie, Mottaqiallah Taouil, Said Hamdioui

CMOS technology and its continuous scaling have made electronics and computers accessible and affordable for almost everyone on the globe; in addition, they have enabled the solutions of a wide range of societal problems and applications. Today, however, both the technology and the computer architectures are facing severe challenges/walls making them incapable of providing the demanded computing power with tight constraints. This motivates the need for the exploration of novel architectures based on new device technologies; not only to sustain the financial benefit of technology scaling, but also to develop solutions for extremely demanding emerging applications. This paper presents two computation-in-memory based accelerators making use of emerging memristive devices; they are Memristive Vector Processor and RRAM Automata Processor. The preliminary results of these two accelerators show significant improvement in terms of latency, energy and area as compared to today's architectures and design. ...

Interconnect Networks for Resistive Computing Architectures

Conference paper (2017) - Hoang Anh Du Nguyen, Lei Xie, Jintao Yu, Mottaqiallah Taouil, Said Hamdioui

Today's computing systems suffer from a memory/communication bottleneck, resulting in high energy consumption and saturated performance. This makes them inefficient in solving data-intensive applications at reasonable cost. Computation-In-Memory (CIM) architecture, based on the integration of storage and computation in the same physical location using non-volatile memristor crossbar technology, offers a potential solution to the memory bottleneck. An efficient interconnect network is essential to maximize CIM's architectural performance. This paper presents three interconnect network schemes for CIM architecture; these are (1) CMOS-based, (2) memristor-based and (3) hybrid cmos/memristor interconnect network scheme. To illustrate the feasibility of such schemes, a CIM parallel adder is used as a case study. The results show that the hybrid interconnect network scheme achieves a higher performance in comparison with the CMOS-based and memristor-based interconnect scheme in terms of delay, energy and area. ...

A Domain-Specific Language and Compiler for Computation-in-Memory Skeletons

Conference paper (2017) - Jintao Yu, Tom Hogervorst, Razvan Nane

Computation-in-Memory (CiM) is a new computer architecture template based on the in-memory computing paradigm. CiM can solve the memory-wall problem of classical Von Neumann-based computer systems by exploiting application-specific computational and data-flow patterns with the capability of performing both storage and computations of emerging resistive RAM technologies (e.g., memristors). However, to efficiently explore and design such radically new application-specific CiM architectures, we require fundamentally new algorithm specification and compilation techniques. In this paper, we introduce a domain-specific language to express not only the computational patterns of an algorithm but also its spatial characteristics. Furthermore, we design a compiler that is able to transform these patterns into highly-optimized CiM designs. Experiments demonstrate the functional correctness of the language and the compiler as well as an order of magnitude speedup improvement over a multicore system in both performance and energy costs. ...

Scouting Logic

A Novel Memristor-Based Logic Design for Resistive Computing

Conference paper (2017) - Lei Xie, Hoang Anh Du Nguyen, Jintao Yu, Ali Kaichouhi, Mottaqiallah Taouil, Mohammad AlFailakawi, Said Hamdioui

Memristor technology is a promising alternative to CMOS due to its high integration density, near-zero standby power, and ability to implement novel resistive computing. One of the major limitations of these architectures is the limited endurance of memristor devices, especially when a logic gate requires multiple steps/switching to execute the logic operations. To alleviate the endurance requirement and improve the performance, we present a novel logic design style, called scouting logic that executes any logic gate by only reading the memristor devices and without changing their states. Hence, no impact on the memristors' endurance. The proposed design is implemented using two styles (current and voltage based). To illustrate the performance of scouting logic based designs, the area, delay, and power consumption are analyzed and compared with state-ofthe- art. The results show that scouting logic improves the delay and power consumption by at least a factor of 2.3, while having similar or less area overhead. Finally, we discuss the potential applications and challenges of scouting logic. ...

On the Robustness of Memristor Based Logic Gates

Conference paper (2017) - Lei Xie, Hoang Anh Du Nguyen, Jintao Yu, Mottaqiallah Taouil, Said Hamdioui

As today's CMOS technology is scaling down to its physical limits, it suffers from major challenges such as increased leakage power and reduced reliability. Novel technologies, such as memristors, nanotube, and graphene transistors, are under research as alternatives. Among these technologies, memristor is a promising candidate due to its great scalability, high integration density and near-zero standby power. However, memristor-based logic circuits are facing robustness challenges mainly due to improper values of design parameters (e.g., OFF/ON ratio, control voltages). Moreover, process variation, sneak path currents and parasitic resistance of nanowires also impact the robustness. To realize a robust design, this paper formulates proper constraints for design parameters to guarantee correct functionality of logic gates (e.g., AND). Our proposal is verified with SPICE simulations while taking both device variation and parasitic effects into account. It is observed that the errors due to analytical parameter constraints are typically within 4.5% as compared to simulations. ...

Skeleton-based design and simulation flow for Computation-in-Memory architectures

Conference paper (2016) - Jintao Yu, Razvan Nane, Adib Haron, Said Hamdioui, H Corporaal, Koen Bertels

Memristor-based Computation-in-Memory is one of the emerging architectures proposed to deal with Big Data problems. The design of such architectures requires a radically new automatic design flow because the memristor is a passive device that uses resistance to encode its logic value. This paper proposes a design flow for mapping parallel algorithms on the CIM architecture. Algorithms with similar data flow graphs can be mapped on the crossbar using the same template containing scheduling, placement, and routing information; this template is named skeleton. By configuring such a skeleton with different pre-designed circuits, we can build CIM implementations of the corresponding algorithms in that class. This approach does not only map an algorithm on a memristor crossbar, but also gives an estimation of its performance, area, and energy consumption. It also supports user-defined constraints and parallel SystemC simulation. Experimental results demonstrate the feasibility and the potential of the approach. ...

Parallel Matrix Multiplication on Memristor-Based Computation-in-Memory Architecture

Conference paper (2016) - Adib Haron, Jintao Yu, Razvan Nane, Mottaqiallah Taouil, Said Hamdioui, Koen Bertels

One of the most important constraints of today’s architectures for data-intensive applications is the limited bandwidth due to the memory-processor communication bottleneck. This significantly impacts performance and energy. For instance, the energy consumption share of communication and memory
access may exceed 80%. Recently, the concept of Computation-in-Memory (CIM) was proposed, which is based on the integration of storage and computation in the same physical location using a crossbar topology and non-volatile resistive-switching memristor technology. To illustrate the tremendous potential of CIM architecture in exploiting massively parallel computation while reducing the communication overhead, we present a communicationefficient mapping of a large-scale matrix multiplication algorithm on the CIM architecture. The experimental results show that, depending on the matrix size, CIM architecture exhibits several orders of magnitude higher performance in total execution time
and two orders of magnitude better in total energy consumption than the multicore-based on the shared memory architecture. ...