Circular Image

M.Z. Zahedi

info

Please Note

15 records found

Efficient Implementation of Binary Neural Network Based on Computation in Memory

Journal article (2024) - Mahdi Zahedi, Taha Shahroodi, Carlos Escuin, Georgi Gaydadjiev, Stephan Wong, Said Hamdioui
Applications of Binary Neural Networks (BNNs) are promising for embedded systems with hard constraints on energy and computing power. Contrary to conventional neural networks using floating-point datatypes, BNNs use binarized weights and activations to reduce memory and computation requirements. Memristors, emerging non-volatile memory devices, show great potential as a target implementation platform for BNNs by integrating storage and compute units. However, the efficiency of this hardware highly depends on how the network is mapped and executed on these devices. In this paper, we propose an efficient implementation of XNOR-based BNN to maximize parallelization. In this implementation, costly analog-to-digital converters are replaced with sense amplifiers with custom reference(s) to generate activation values. Besides, a novel mapping is introduced to minimize the overhead of data communication between convolution layers mapped to different memristor crossbars. This comes with extensive analytical and simulation-based analysis to evaluate the implication of different design choices considering the accuracy of the network. The results show that our approach achieves up to 5× energy-saving and 100× improvement in latency compared to baselines. ...
The vast potential of memristor-based computation-in-memory (CIM) engines has mainly triggered the mapping of best-suited applications. Nevertheless, with additional support, existing applications can also benefit from CIM. In particular, this paper proposes an energy and area-efficient CIM-based methodology to perform arithmetic signed matrix multiplications. Our approach combines a) the mapping of the signed operands on the 1T1R crossbar, and b) the augmentation of the periphery with customized circuits to support the execution of shift and accumulate needed for the arithmetic operations. The operand mapping is performed without the need for sign extension; hence, reducing the required memory size. To demonstrate the superiority of our scheme as compared with the state-of-the-art, simulations are performed for different case studies including a neural network and two kernels which are taken from the Polybench/C benchmark suite. The results show that our approach achieves up to 8× energy-saving and 3× area-saving compared with other CIM-based prior works. ...
The high execution time of DNA sequence alignment negatively affects many genomic studies that rely on sequence alignment results. Pre-alignment filtering was introduced as a step before alignment to reduce the execution time of short-read sequence alignment greatly. With its success, i.e., achieving high accuracy and thus removing unnecessary alignments, the filtering itself now constitutes the larger portion of the execution time. A significant contributing factor entails the movement of sequences from the memory to the processing units, while a majority will filter out as they do not result in an acceptable alignment. State-of-the-art (SotA) pre-alignment filtering accelerators suffer from the same overhead for data movements. Furthermore, these accelerators lack support for future pre-alignment filtering algorithms using the same operations and underlying hardware. This paper addresses these shortcomings by introducing SieveMem. SieveMem is an architecture that exploits the Computation-in-Memory paradigm with memristive-based devices to support shared kernels of pre-alignment filters and algorithms inside the memory (i.e., preventing data movements). SieveMem architecture also provides support for future algorithms. SieveMem supports more than 47.6% of shared operations among all top 5 SotA filters. Moreover, SieveMem includes a hardware-friendly pre-alignment filtering algorithm called BandedKrait, inspired by a combination of mentioned kernels. Our evaluations show that SieveMem provides up to 331.1 x and 446.8 × improvement in the execution time of the two most-common kernels. Our evaluations also show that BandedKrait provides accuracy at the SotA level. Using BandedKrait on SieveMem, a design we call Mem-BandedKrait, one can improve the execution time of end-to-end sequence alignment irrespective of the dataset, which can go up to 91.4 × compared to the SotA accelerator on GPU. ...
Doctoral thesis (2023) - M.Z. Zahedi, S. Hamdioui, J.S.S.M. Wong
Computation-in-Memory (CIM) is a promising alternative to traditional computing systems where the storage is conceptually separated fromthe computing units. Instead, the CIM paradigm aims to perform the computation where the data resides, alleviating the memory bottleneck and ultimately leading to higher energy efficiency and performance. From the memory technology perspective, memristors, emerging non-volatile memory devices, demonstrate various beneficial characteristics. Although the concept of CIM, in combination with these emerging memory technologies, is in the infancy stage, it shows great potential as a future of computing systems. To further understand and quantify the potential of CIM, more development is required at each abstraction level. In this thesis, we first explore the main potentials for memristor-based computation-in-memory. Then, we study different applications from the CIM perspective to understand different behaviors and patterns of applications and use this knowledge to develop architectural solutions for CIM. Based on that, we study the realization of CIM as a generic and flexible platformat amicro-architecture level. ...
Conference paper (2023) - Carlos Escuin, Fernando García-Redondo, Francky Catthoor, Mahdi Zahedi, Pablo Ibáñez, Teresa Monreal, Victor Viñals, José María Llabería, James Myers, Julien Ryckaert, Dwaipayan Biswas
This paper optimizes the MNEMOSENE architecture, a compute-in-memory (CiM) tile design integrating computation and storage for increased efficiency. We identify and address bottlenecks in the Row Data (RD) buffer that cause losses in performance. Our proposed approach includes mitigating these buffering bottlenecks and extending MNEMOSENE’s single-tile design to a multi-tile configuration for improved parallel processing. The proposal is validated through comprehensive analyses exploring the mapping of diverse neural networks evaluated on CiM crossbar arrays based on NVM technologies. These proposed enhancements lead up to 55% reduction in execution time compared to the original single-tile architecture for any general matrix multiplication (GEMM) operation. Our evaluation shows that while ReRAM and PCM offer notable energy advantages, their integration with scaled CMOS is limited, which leads to VGSOT-MRAM emerging as a promising alternative due to its good balance between energy efficiency and superior integration capabilities. The VGSOT-MRAM crossbar arrays provide 12×,49×, and 346× more energy efficiency than PCM, ReRAM, and STT-MRAM ones, respectively. It translates, on average for the considered workload, in 1.5×,3×, and 14.5× better energy efficiency of the entire system. ...

Energy-efficient Design for In-memory Sparse-based Graph Processing

Performing analysis on large graph datasets in an energy-efficient manner has posed a significant challenge; not only due to excessive data movements and poor locality, but also due to the non-optimal use of high sparsity of such datasets. The latter leads to a waste of resources as the computation is also performed on zero's operands which do not contribute to the final result. This paper designs a novel graph processing accelerator, SparseMEM, targeting sparse datasets by leveraging the computing-in-memory (CIM) concept; CIM is a promising solution to alleviate the overhead of data movement and the inherent poor locality of graph processing. The proposed solution stores the graph information in a compressed hierarchical format inside the memory and adjusts the workflow based on this new mapping. This vastly improves resource utilization, leading to higher energy and permanence efficiency. The experimental results demonstrate that SparseMEM outperforms a GPU-based platform and two state-of-the-art in-memory accelerators on speedup and energy efficiency by one and three orders of magnitude, respectively. ...
Conference paper (2023) - Taha Shahroodi, Rafaela Cardoso, Mahdi Zahedi, Stephan Wong, Alberto Bosio, Ian O'Connor, Said Hamdioui
This paper investigates the potential of a compute-in-memory core based on optical Phase Change Materials (oPCMs) to speed up and reduce the energy consumption of the Matrix-Matrix-Multiplication operation. The paper also proposes a new data mapping for Binary Neural Networks (BNNs) tailored for our oPCM core. The preliminary results show a significant latency improvement irrespective of the evaluated network structure and size. The improvement varies from network to network and goes up to ~1053x. ...

A Fast and Energy-Efficient Food Profiler Using Hyperdimensional Computing in Memory

Journal article (2022) - Taha Shahroodi, Mahdi Zahedi, Can Firtina, Mohammed Alser, Stephan Wong, Onur Mutlu, Said Hamdioui
Food profiling is an essential step in any food monitoring system needed to prevent health risks and potential frauds in the food industry. Significant improvements in sequencing technologies are pushing food profiling to become the main computational bottleneck. State-of-the-art profilers are unfortunately too costly for food profiling. Our goal is to design a food profiler that solves the main limitations of existing profilers, namely (1) working on massive data structures and (2) incurring considerable data movement, for a real-time monitoring system. To this end, we propose Demeter, the first platform-independent framework for food profiling. Demeter overcomes the first limitation through the use of hyperdimensional computing (HDC) and efficiently performs the accurate few-species classification required in food profiling. We overcome the second limitation by the use of an in-memory hardware accelerator for Demeter (named Acc-Demeter) based on memristor devices. Acc-Demeter actualizes several domain-specific optimizations and exploits the inherent characteristics of memristors to improve the overall performance and energy consumption of Acc-Demeter. We compare Demeter’s accuracy with other industrial food profilers using detailed software modeling. We synthesize Acc-Demeter’s required hardware using UMC’s 65nm library by considering an accurate PCM model based on silicon-based prototypes. Our evaluations demonstrate that Acc-Demeter achieves a (1) throughput improvement of 192× and 724× and (2) memory reduction of 36× and 33× compared to Kraken2 and MetaCache (2 state-of-the-art profilers), respectively, on typical food-related databases. Demeter maintains an acceptable profiling accuracy (within 2% of existing tools) and incurs a very low area overhead. ...

From Primitive to Complex Functions

In recent years, we are witnessing a trend moving away from conventional computer architectures towards Computation-In-Memory (CIM) based on emerging memristor devices. This is due to the fact that the performance and energy efficiency of traditional computer architectures can no longer be increased at the same pace as before. The main barriers which limit the performance and energy improvement are the memory and power walls. Thus far, the main effort from researchers is toward enabling CIM as an accelerator for specific applications. Consequently, this current application-specific nature/approach has put less emphasis on the potential general-purpose applicability of CIM, i.e., merging several accelerators into one that is less than the sum of the parts. In this paper, we demonstrate the CIM concept using a broader and generalized model. Considering this model, the state-of-the-art CIM-based logic and arithmetic primitive functions, which can be the building blocks for complex functions, are investigated. Besides, we present potential applications of CIM which provides insights into the challenges and opportunities of a generic CIM system design. Finally, we highlight the future directions regarding the construction of CIM-based systems. ...
The massive deployment of Internet of Things (IoT) devices makes them vulnerable against physical tampering attacks, such as fault injection. These kind of hardware attacks are very popular as they typically do not require complex equipment or high expertise. Hence, it is important that IoT devices are protected against them. In this work, we present a novel fault injection attack detector with high flexibility and low overhead. Our solution is based on the reuse of a security primitive used in many IoT devices, i.e., ring oscillator (RO) physically unclonable function (PUF). Our results show that we obtain a high detection effectiveness and no false alarms against most popular fault injection attacks based on voltage and clock manipulations. ...

Tile Architecture and Simulator for Memristor-based Computation-in-memory

Journal article (2022) - Mahdi Zahedi, Muah Abu Lebdeh, Christopher Bengel, Dirk Wouters, Stephan Menzel, Manuel Le Gallo, Abu Sebastian, Stephan Wong, Said Hamdioui
In recent years, we are witnessing a trend toward in-memory computing for future generations of computers that differs from traditional von-Neumann architecture in which there is a clear distinction between computing and memory units. Considering that data movements between the central processing unit (CPU) and memory consume several orders of magnitude more energy compared to simple arithmetic operations in the CPU, in-memory computing will lead to huge energy savings as data no longer needs to be moved around between these units. In an initial step toward this goal, new non-volatile memory technologies, e.g., resistive RAM (ReRAM) and phase-change memory (PCM), are being explored. This has led to a large body of research that mainly focuses on the design of the memory array and its peripheral circuitry. In this article, we mainly focus on the tile architecture (comprising a memory array and peripheral circuitry) in which storage and compute operations are performed in the (analog) memory array and the results are produced in the (digital) periphery. Such an architecture is termed compute-in-memory-periphery (CIM-P). More precisely, we derive an abstract CIM-tile architecture and define its main building blocks. To bridge the gap between higher-level programming languages and the underlying (analog) circuit designs, an instruction-set architecture is defined that is intended to control and, in turn, sequence the operations within this CIM tile to perform higher-level more complex operations. Moreover, we define a procedure to pipeline the CIM-tile operations to further improve the performance. To simulate the tile and perform design space exploration considering different technologies and parameters, we introduce the fully parameterized first-of-its-kind CIM tile simulator and compiler. Furthermore, the compiler is technology-aware when scheduling the CIM-tile instructions. Finally, using the simulator, we perform several preliminary design space explorations regarding the three competing technologies, ReRAM, PCM, and STT-MRAM concerning CIM-tile parameters, e.g., the number of ADCs. Additionally, we investigate the effect of pipelining in relation to the clock speeds of the digital periphery assuming the three technologies. In the end, we demonstrate that our simulator is also capable of reporting energy consumption for each building block within the CIM tile after the execution of in-memory kernels considering the data-dependency on the energy consumption of the memory array. All the source codes are publicly available. ...

A Memristor-Augmented HW/SW Framework for Taxonomic Profiling

State-of-the-art taxonomic profilers that comprise the first step in larger-context metagenomic studies have proven to be computationally intensive, i.e., while accurate, they come at the cost of high latency and energy consumption. Table Lookup operation is a primary bottleneck of today's profilers. In this paper, we first propose TL-PIM, a hardware accelerator based on the processing-in-memory (PIM) paradigm to accelerate Table Lookup. TL-PIM leverages the in-memory compute capability of emerging memory technologies along with intelligent data mapping. Then, we integrate TL-PIM into Kraken2, a state-of-the-art metagenomic profiler, and build an HW/SW co-designed profiler, called KrakenOnMem. Results from a silicon-based prototype of our emerging memory validate the design and required operations on a smaller scale. Our large-scale calibrated simulations show that KrakenOnMem can provide an average of 61.3% speedup compared to original Kraken2 for end-to-end profiling. Additionally, our design improves the energy consumption by orders of magnitude compared to the original Kraken2 while incurring a negligible area overhead. ...
Conference paper (2022) - Abhairaj Singh, Mahdi Zahedi, Taha Shahroodi, Mohit Gupta, Anteneh Gebregiorgis, Manu Komalan, Rajiv V. Joshi, Francky Catthoor, Rajendra Bishnoi, Said Hamdioui
Spin-transfer torque magnetic random access memory (STT-MRAM) based computation-in-memory (CIM) architectures have shown great prospects for an energy-efficient computing. However, device variations and non-idealities narrow down the sensing margin that severely impacts the computing accuracy. In this work, we propose an adaptive referencing mechanism to improve the sensing margin of a CIM architecture for boolean binary logic (BBL) operations. We generate reference signals using multiple STT-MRAM devices and place them strategically into the array such that these signals can address the variations and trace the wire parasitics effectively. We have demonstrated this behavior using an STT-MRAM model, which is calibrated using 1Mbit characterized array. Results show that our proposed architecture for binary neural networks (BNN) achieves up to 17.8 TOPS/W on the MNIST dataset and 130× performance improvement for the text encryption compared to the software implementation on Intel Haswell processor. ...
Computation-in-memory (CIM) shows great promise for specific applications by employing emerging (non-volatile) memory technologies such as memristors for both storage and compute, greatly reducing energy consumption, and improving performance. Based on our own observations, we can clearly perceive the contours of a generic approach encompassing the use of a memristor array – using technologies such as PCM and ReRAM. In this paper, we present a new instruction-set architecture (ISA) to control a single CIM-tile that comprises the analog memory array itself and all necessary analog and digital periphery. The newly introduced ISA provides the following advantages: (1) flexibility in programming new CIM functionalities by simply rescheduling the instructions from the ISA, (2) definition of a simulation framework, (3) a hardware implementation of the digital periphery, and (4) a design-space exploration of specific CIM-tile operations targeting the aforementioned technologies. For (1), we defined our own compiler that can translate CIM-tile operations to a sequence of instructions from our ISA. The implementation of the digital periphery is synthesized with the 15 nm Nangate library and results regarding power/energy and area are presented. Finally, the design-space exploration is made possible by using the technology-specific parameters with values that have been verified by accurate technology models. All codes of the compiler and simulator as well as the HDL code of the digital periphery are publicly available. ...
Von Neumann-based architectures suffer from costly communication between CPU and memory. This communication imposes several orders of magnitude more power and performance overheads compared to the arithmetic operations performed by the processor. This overhead becomes critical for applications that require processing a large amount of data. Computation-in-Memory (CIM) leveraging memristor devices in the crossbar structure offers a potential solution to tackle this challenge. However, support for the integer data type is lacking in CIM approaches as most solutions operate on a single/few bits only. This paper proposes a new organization of the periphery (next to memristor crossbar) to compute matrix-matrix multiplication (MMM) at the tile level. More precisely, the analog additions performed in the crossbar is complemented with additions performed in the digital periphery. In this mixed analog-digital system, digital additions are performed in a way that only the minimum size of adders are required-this is to reduce the latency of the digital periphery as much as possible. In addition, the design is customized to the number of ADCs as well as datatype sizes to support different possible scenarios. The results show that our organization reduces energy and latency up to 50x and 3x, respectively, compared to the reference design. ...