Circular Image

T. Shahroodi

info

Please Note

16 records found

Efficient Implementation of Binary Neural Network Based on Computation in Memory

Journal article (2024) - Mahdi Zahedi, Taha Shahroodi, Carlos Escuin, Georgi Gaydadjiev, Stephan Wong, Said Hamdioui
Applications of Binary Neural Networks (BNNs) are promising for embedded systems with hard constraints on energy and computing power. Contrary to conventional neural networks using floating-point datatypes, BNNs use binarized weights and activations to reduce memory and computation requirements. Memristors, emerging non-volatile memory devices, show great potential as a target implementation platform for BNNs by integrating storage and compute units. However, the efficiency of this hardware highly depends on how the network is mapped and executed on these devices. In this paper, we propose an efficient implementation of XNOR-based BNN to maximize parallelization. In this implementation, costly analog-to-digital converters are replaced with sense amplifiers with custom reference(s) to generate activation values. Besides, a novel mapping is introduced to minimize the overhead of data communication between convolution layers mapped to different memristor crossbars. This comes with extensive analytical and simulation-based analysis to evaluate the implication of different design choices considering the accuracy of the network. The results show that our approach achieves up to 5× energy-saving and 100× improvement in latency compared to baselines. ...
Conference paper (2024) - Taha Shahroodi, Raphael Cardoso, Stephan Wong, Alberto Bosio, Ian O'Connor, Said Hamdioui
State-of-the-Art (SotA) hardware implementations of Deep Neural Networks (DNNs) incur high latencies and costs. Binary Neural Networks (BNNs) are potential alternative solutions to realize faster implementations without losing accuracy. In this paper, we first present a new data mapping, called TacitMap, suited for BNNs implemented based on a Computation-In-Memory (CIM) architecture. TacitMap maximizes the use of available parallelism, while CIM architecture eliminates the data movement overhead. We then propose a hardware accelerator based on optical phase change memory (oPCM) called EinsteinBarrier. Ein-steinBarrier incorporates TacitMap and adds an extra dimension for parallelism through wavelength division multiplexing, leading to extra latency reduction. The simulation results show that, compared to the SotA CIM baseline, TacitMap and EinsteinBarrier significantly improve execution time by up to ∼ 154× and ∼ 3113×, respectively, while also maintaining the energy consumption within 60% of that in the CIM baseline. ...
Journal article (2024) - Asif Ali Khan, Fazal Hameed, Taha Shahroodi, Alex K. Jones, Jeronimo Castrillon
DNA sequence alignment is a fundamental and computationally expensive operation in bioinformatics. Researchers have developed pre-alignment filters that effectively reduce the amount of data consumed by the alignment process by discarding locations that result in a poor match. However, the filtering operation itself is memory-intensive for which the conventional Von-Neumann architectures perform poorly. Therefore, recent designs advocate compute near memory (CNM) accelerators based on stacked DRAM and more exotic memory technologies such as racetrack memories (RTM). However, these designs only support small DNA reads of circa 100 nucleotides, referred to as short reads. This letter proposes a CNM system for handling both long and short reads. It introduces a novel data-placement solution that significantly increases parallelism and reduces overhead. Evaluation results show substantial reductions in execution time (1.32times1.32×) and energy consumption (50%), compared to the state-of-the-art. ...
Doctoral thesis (2024) - T. Shahroodi
Modern applications like Genomics and Machine Learning (ML) hold the potential to reshape our understanding of diseases’ genetic origins and guide machines in executing tasks and making predictions without our explicit programming. The successful, widespread integration of these modern applications can usher in advancements in di-agnostics, individualized medicine, and routine tasks such as language interpretation, image analysis, and object categorization. However, our traditional computing infrastructures fall short when accommodating the distinct characteristics of these new applications. Specifically, (1) these applications handle an immense and ever-expanding data working set, and (2) each succeeding version of these applications and their associated use cases necessitates quicker and more energy-efficient analysis of these vast data sets. This is because our traditional computing systems largely hinge on (1) the von-Neumann architecture, a design that distinctly positions processing entities (like CPUs and GPUs) away from storage components (like memories and flash drives), and (2) the CMOS-based technology. While attempting to meet the performance and energy demands of our modern applications, these fully CMOS-based systems based on von-Neumann architecture have increasingly struggled and hit inherent roadblocks, with data movement overhead being the predominant issue. To alleviate the data movement bottleneck, contemporary research revisits a concept historically known as Computation-In-Memory (CIM) or, alternatively, Processing-In-Memory (PIM). At its core, CIM emphasizes positioning computational capabilities close to, or within, the memory units storing the data. This placement might be within memory chips, in memory controllers, amid caches, or embedded in the logic layers of 3D-stacked memories. As a computational model, architectures leveraging CIM (referred to as CIM architectures) stand to tackle the issue of data movement overhead inherent in the von-Neumann architecture by diminishing or outright eradicating the data movement between computational locales and data storage areas. Moreover, from a techno-logical perspective, emerging memory technologies, including memristive devices and circuits, show potential to replace traditional memory systems, addressing some of the challenges posed by CMOS-based designs. Irrespective of the specific CIM architecture deployed to optimize performance or energy efficiency in modern applications, there are substantial practical challenges to address and ponder upon first. Both system designers and developers face these hurdles and design decisions, which are critical to surmount CIM’s widespread acceptance across various computational areas and application domains. In this dissertation, our focus is twofold: (1) We delve into the acceleration and streamlined execution of various steps in two pivotal application realms: genomics and ML; and (2) We explore several emerging memory technologies alongside circuit and architectural strategies, that show promise in enhancing CIM designs, specifically tailored for modern applications. Therefore, in this thesis, we identify and propose strategies and designs to ameliorate the constrained performance of key kernels in genomics and ML. Recognizing that applications within these realms consist of diverse functions or kernels, it is imperative for a designer to possess a thorough understanding of them. Each function/kernel can be characterized by distinct data and control flows, calling for varied features to be enabled in either a von-Neumann or a CIM architecture. To enhance the efficacy of each function/kernel, we first profile them individually and then within a larger context of their corresponding pipeline, followed by discerning the best avenues for their memory mapping in a CIM architecture. We then undertake a concurrent assessment of essential adjunct components alongside the memory array, commonly referred to as the peripheries. For a designer, proficiency in the applications executable on a CIM system leveraging emerging memory technologies is indispensable. Grasping the fundamental characteristics of CIM and having an overarching view of its scope becomes vital prior to its integration. We aim to aggregate critical application features, improvement opportunities, and design decisions and refine them to their core essence. Through this, we aspire to shed light on present design options and identify kernels demanding heightened attention. Such insights can be instrumental in revealing prospective directions, encompassing supported kernels along with their respective merits and trade-offs. We exploit emerging technologies and architect state-of-the-art CIM designs that optimally serve the targeted kernels, keeping a holistic improvement perspective at the forefront. Delving into emerging (memory) technologies, such as memristive devices like PCM and STT-MRAM, is crucial. These devices provide a suite of advantages, including non-volatility, compactness, and a natural aptitude for conducting logical operations (for instance, the logical AND). Additionally, other emerging technologies, such as integrated photonics, have the potential to enhance the CIM paradigm further with their capacity for high-frequency and low-latency functions. Our ambition is to integrate multiple such technologies, harnessing their distinct attributes, to craft a CIM design that surpasses the SotA counterparts across key benchmarks, be it in execution speed or energy. This thesis demonstrates that when CIM is fused with emerging (memory) technologies, there is a marked enhancement in the performance of several Genomics pipelines and Machine Learning applications. It is our aspiration and conviction that the evaluations, methodologies, and findings detailed in this dissertation will empower the broader community to comprehend and address contemporary and upcoming challenges that revolve around enhancing the performance and energy efficiency of modern applications through the integration of (re)emerging computing paradigms and technologies. Additionally, our work provides insights for adapting these technologies to novel applications, ensuring they deliver optimal benefits. ...
Conference paper (2024) - Asmae El Arrassi, Mohammad Amin Yaldagard, Xingjian Tao, Taha Shahroodi, Fouwad Mir, Yashvardhan Biyani, Manil Dev Gomony, Anteneh Gebregiorgis, Rajiv Joshi, Said Hamdioui
Binary Neural Networks (BNNs) have demonstrated significant advantages in reducing computation and memory costs, all while maintaining acceptable accuracy on various image detection tasks. Thus, BNNs have the potential to support practical cognitive tasks on resource-constrained platforms, such as edge computing devices. To realize this, SRAM-based digital Computation-in-Memory (CIM) has gained growing attention as it overcomes the analog CIM architecture bottlenecks such as limited computing accuracy due to process variation, non-linearity, power and area-hungry Analog-to-Digital Converters (ADCs), etc. However, digital CIM architectures are highly dominated by power-hungry adder-trees, which can nullify the benefits of SRAM-based digital CIM. To address this issue, this paper proposes an adder free SRAM-based digital CIM, AFSRAM-CIM, for BNN acceleration. The proposed CIM architecture utilizes a multi-functional 10-T SRAM cell-based crossbar array and a new energy-efficient approach to perform the popcount operation. Simulation results using the MNIST dataset show that the proposed architecture maintains the state-of-the-art inference accuracy of 99.21% with only 11.86 fJ energy per operation. Moreover, AFSRAM-CIM achieves over 3× energy and ≈17× area savings when compared to the conventional digital CIM approaches. ...

Accelerating Profile Hidden Markov Models for Fast and Energy-efficient Genome Analysis

Journal article (2024) - Can Firtina, Kamlesh Pillai, Gurpreet S. Kalsi, Bharathwaj Suresh, Damla Senol Cali, Jeremie S. Kim, Taha Shahroodi, Meryem Banu Cavlak, Joël Lindegger, More authors...
Profile hidden Markov models (pHMMs) are widely employed in various bioinformatics applications to identify similarities between biological sequences, such as DNA or protein sequences. In pHMMs, sequences are represented as graph structures, where states and edges capture modifications (i.e., insertions, deletions, and substitutions) by assigning probabilities to them. These probabilities are subsequently used to compute the similarity score between a sequence and a pHMM graph. The Baum-Welch algorithm, a prevalent and highly accurate method, utilizes these probabilities to optimize and compute similarity scores. Accurate computation of these probabilities is essential for the correct identification of sequence similarities. However, the Baum-Welch algorithm is computationally intensive, and existing solutions offer either software-only or hardware-only approaches with fixed pHMM designs. When we analyze state-of-the-art works, we identify an urgent need for a flexible, high-performance, and energy-efficient hardware-software co-design to address the major inefficiencies in the Baum-Welch algorithm for pHMMs. We introduce ApHMM, the first flexible acceleration framework designed to significantly reduce both computational and energy overheads associated with the Baum-Welch algorithm for pHMMs. ApHMM employs hardware-software co-design to tackle the major inefficiencies in the Baum-Welch algorithm by (1) designing flexible hardware to accommodate various pHMM designs, (2) exploiting predictable data dependency patterns through on-chip memory with memoization techniques, (3) rapidly filtering out unnecessary computations using a hardware-based filter, and (4) minimizing redundant computations. ApHMM achieves substantial speedups of 15.55×–260.03×, 1.83×–5.34×, and 27.97× when compared to CPU, GPU, and FPGA implementations of the Baum-Welch algorithm, respectively. ApHMM outperforms state-of-the-art CPU implementations in three key bioinformatics applications: (1) error correction, (2) protein family search, and (3) multiple sequence alignment, by 1.29×–59.94×, 1.03×–1.75×, and 1.03×–1.95×, respectively, while improving their energy efficiency by 64.24×–115.46×, 1.75×, and 1.96×. ...
Conference paper (2023) - Taha Shahroodi, Rafaela Cardoso, Mahdi Zahedi, Stephan Wong, Alberto Bosio, Ian O'Connor, Said Hamdioui
This paper investigates the potential of a compute-in-memory core based on optical Phase Change Materials (oPCMs) to speed up and reduce the energy consumption of the Matrix-Matrix-Multiplication operation. The paper also proposes a new data mapping for Binary Neural Networks (BNNs) tailored for our oPCM core. The preliminary results show a significant latency improvement irrespective of the evaluated network structure and size. The improvement varies from network to network and goes up to ~1053x. ...
The vast potential of memristor-based computation-in-memory (CIM) engines has mainly triggered the mapping of best-suited applications. Nevertheless, with additional support, existing applications can also benefit from CIM. In particular, this paper proposes an energy and area-efficient CIM-based methodology to perform arithmetic signed matrix multiplications. Our approach combines a) the mapping of the signed operands on the 1T1R crossbar, and b) the augmentation of the periphery with customized circuits to support the execution of shift and accumulate needed for the arithmetic operations. The operand mapping is performed without the need for sign extension; hence, reducing the required memory size. To demonstrate the superiority of our scheme as compared with the state-of-the-art, simulations are performed for different case studies including a neural network and two kernels which are taken from the Polybench/C benchmark suite. The results show that our approach achieves up to 8× energy-saving and 3× area-saving compared with other CIM-based prior works. ...

Energy-efficient Design for In-memory Sparse-based Graph Processing

Performing analysis on large graph datasets in an energy-efficient manner has posed a significant challenge; not only due to excessive data movements and poor locality, but also due to the non-optimal use of high sparsity of such datasets. The latter leads to a waste of resources as the computation is also performed on zero's operands which do not contribute to the final result. This paper designs a novel graph processing accelerator, SparseMEM, targeting sparse datasets by leveraging the computing-in-memory (CIM) concept; CIM is a promising solution to alleviate the overhead of data movement and the inherent poor locality of graph processing. The proposed solution stores the graph information in a compressed hierarchical format inside the memory and adjusts the workflow based on this new mapping. This vastly improves resource utilization, leading to higher energy and permanence efficiency. The experimental results demonstrate that SparseMEM outperforms a GPU-based platform and two state-of-the-art in-memory accelerators on speedup and energy efficiency by one and three orders of magnitude, respectively. ...
The high execution time of DNA sequence alignment negatively affects many genomic studies that rely on sequence alignment results. Pre-alignment filtering was introduced as a step before alignment to reduce the execution time of short-read sequence alignment greatly. With its success, i.e., achieving high accuracy and thus removing unnecessary alignments, the filtering itself now constitutes the larger portion of the execution time. A significant contributing factor entails the movement of sequences from the memory to the processing units, while a majority will filter out as they do not result in an acceptable alignment. State-of-the-art (SotA) pre-alignment filtering accelerators suffer from the same overhead for data movements. Furthermore, these accelerators lack support for future pre-alignment filtering algorithms using the same operations and underlying hardware. This paper addresses these shortcomings by introducing SieveMem. SieveMem is an architecture that exploits the Computation-in-Memory paradigm with memristive-based devices to support shared kernels of pre-alignment filters and algorithms inside the memory (i.e., preventing data movements). SieveMem architecture also provides support for future algorithms. SieveMem supports more than 47.6% of shared operations among all top 5 SotA filters. Moreover, SieveMem includes a hardware-friendly pre-alignment filtering algorithm called BandedKrait, inspired by a combination of mentioned kernels. Our evaluations show that SieveMem provides up to 331.1 x and 446.8 × improvement in the execution time of the two most-common kernels. Our evaluations also show that BandedKrait provides accuracy at the SotA level. Using BandedKrait on SieveMem, a design we call Mem-BandedKrait, one can improve the execution time of end-to-end sequence alignment irrespective of the dataset, which can go up to 91.4 × compared to the SotA accelerator on GPU. ...

Enabling Massively Parallel Computation in DRAM via Lookup Tables

Conference paper (2022) - Joao Dinis Ferreira, Gabriel Falcao, Juan Gomez-Luna, Mohammed Alser, Lois Orosa, Mohammad Sadrosadati, Jeremie S. Kim, Geraldo F. Oliveira, Taha Shahroodi, More authors...
Data movement between the main memory and the processor is a key contributor to execution time and energy consumption in memory-intensive applications. This data movement bottleneck can be alleviated using Processing-in-Memory (PiM). One category of PiM is Processing-using-Memory (PuM), in which computation takes place inside the memory array by exploiting intrinsic analog properties of the memory device. PuM yields high performance and energy efficiency, but existing PuM techniques support a limited range of operations. As a result, current PuM architectures cannot efficiently perform some complex operations (e.g., multiplication, division, exponentiation) without large increases in chip area and design complexity. To overcome these limitations of existing PuM architectures, we introduce pLUTo (processing-using-memory with lookup table (LUT) operations), a DRAM-based PuM architecture that leverages the high storage density of DRAM to enable the massively parallel storing and querying of lookup tables (LUTs). The key idea of pLUTo is to replace complex operations with low-cost, bulk memory reads (i.e., LUT queries) instead of relying on complex extra logic. We evaluate pLUTo across 11 real-world workloads that showcase the limitations of prior PuM approaches and show that our solution outperforms optimized CPU and GPU base-lines by an average of 713 × and 1.2 ×, respectively, while simultaneously reducing energy consumption by an average of 1855 × and 39.5 ×. Across these workloads, pLUTo outperforms state-of-the-art PiM architectures by an average of 18.3 ×. We also show that different versions of pLUTo provide different levels of flexibility and performance at different additional DRAM area overheads (between 10.2% and 23.1%). pLUTo's source code and all scripts required to reproduce the results of this paper are openly and fully available at https://github.com/CMU-SAFARI/pLUTo. ...
Conference paper (2022) - Abhairaj Singh, Mahdi Zahedi, Taha Shahroodi, Mohit Gupta, Anteneh Gebregiorgis, Manu Komalan, Rajiv V. Joshi, Francky Catthoor, Rajendra Bishnoi, Said Hamdioui
Spin-transfer torque magnetic random access memory (STT-MRAM) based computation-in-memory (CIM) architectures have shown great prospects for an energy-efficient computing. However, device variations and non-idealities narrow down the sensing margin that severely impacts the computing accuracy. In this work, we propose an adaptive referencing mechanism to improve the sensing margin of a CIM architecture for boolean binary logic (BBL) operations. We generate reference signals using multiple STT-MRAM devices and place them strategically into the array such that these signals can address the variations and trace the wire parasitics effectively. We have demonstrated this behavior using an STT-MRAM model, which is calibrated using 1Mbit characterized array. Results show that our proposed architecture for binary neural networks (BNN) achieves up to 17.8 TOPS/W on the MNIST dataset and 130× performance improvement for the text encryption compared to the software implementation on Intel Haswell processor. ...

A Memristor-Augmented HW/SW Framework for Taxonomic Profiling

State-of-the-art taxonomic profilers that comprise the first step in larger-context metagenomic studies have proven to be computationally intensive, i.e., while accurate, they come at the cost of high latency and energy consumption. Table Lookup operation is a primary bottleneck of today's profilers. In this paper, we first propose TL-PIM, a hardware accelerator based on the processing-in-memory (PIM) paradigm to accelerate Table Lookup. TL-PIM leverages the in-memory compute capability of emerging memory technologies along with intelligent data mapping. Then, we integrate TL-PIM into Kraken2, a state-of-the-art metagenomic profiler, and build an HW/SW co-designed profiler, called KrakenOnMem. Results from a silicon-based prototype of our emerging memory validate the design and required operations on a smaller scale. Our large-scale calibrated simulations show that KrakenOnMem can provide an average of 61.3% speedup compared to original Kraken2 for end-to-end profiling. Additionally, our design improves the energy consumption by orders of magnitude compared to the original Kraken2 while incurring a negligible area overhead. ...

From Primitive to Complex Functions

In recent years, we are witnessing a trend moving away from conventional computer architectures towards Computation-In-Memory (CIM) based on emerging memristor devices. This is due to the fact that the performance and energy efficiency of traditional computer architectures can no longer be increased at the same pace as before. The main barriers which limit the performance and energy improvement are the memory and power walls. Thus far, the main effort from researchers is toward enabling CIM as an accelerator for specific applications. Consequently, this current application-specific nature/approach has put less emphasis on the potential general-purpose applicability of CIM, i.e., merging several accelerators into one that is less than the sum of the parts. In this paper, we demonstrate the CIM concept using a broader and generalized model. Considering this model, the state-of-the-art CIM-based logic and arithmetic primitive functions, which can be the building blocks for complex functions, are investigated. Besides, we present potential applications of CIM which provides insights into the challenges and opportunities of a generic CIM system design. Finally, we highlight the future directions regarding the construction of CIM-based systems. ...

A Fast and Energy-Efficient Food Profiler Using Hyperdimensional Computing in Memory

Journal article (2022) - Taha Shahroodi, Mahdi Zahedi, Can Firtina, Mohammed Alser, Stephan Wong, Onur Mutlu, Said Hamdioui
Food profiling is an essential step in any food monitoring system needed to prevent health risks and potential frauds in the food industry. Significant improvements in sequencing technologies are pushing food profiling to become the main computational bottleneck. State-of-the-art profilers are unfortunately too costly for food profiling. Our goal is to design a food profiler that solves the main limitations of existing profilers, namely (1) working on massive data structures and (2) incurring considerable data movement, for a real-time monitoring system. To this end, we propose Demeter, the first platform-independent framework for food profiling. Demeter overcomes the first limitation through the use of hyperdimensional computing (HDC) and efficiently performs the accurate few-species classification required in food profiling. We overcome the second limitation by the use of an in-memory hardware accelerator for Demeter (named Acc-Demeter) based on memristor devices. Acc-Demeter actualizes several domain-specific optimizations and exploits the inherent characteristics of memristors to improve the overall performance and energy consumption of Acc-Demeter. We compare Demeter’s accuracy with other industrial food profilers using detailed software modeling. We synthesize Acc-Demeter’s required hardware using UMC’s 65nm library by considering an accurate PCM model based on silicon-based prototypes. Our evaluations demonstrate that Acc-Demeter achieves a (1) throughput improvement of 192× and 724× and (2) memory reduction of 36× and 33× compared to Kraken2 and MetaCache (2 state-of-the-art profilers), respectively, on typical food-related databases. Demeter maintains an acceptable profiling accuracy (within 2% of existing tools) and incurs a very low area overhead. ...

A customizable hardware prefetching framework using online reinforcement learning

Conference paper (2021) - Rahul Bera, Konstantinos Kanellopoulos, Anant V. Nori, Taha Shahroodi, Sreenivas Subramoney, Onur Mutlu
Past research has proposed numerous hardware prefetching techniques, most of which rely on exploiting one specific type of program context information (e.g., program counter, cacheline address, or delta between cacheline addresses) to predict future memory accesses. These techniques either completely neglect a prefetcher's undesirable effects (e.g., memory bandwidth usage) on the overall system, or incorporate system-level feedback as an afterthought to a system-unaware prefetch algorithm.We showthat prior prefetchers often lose their performance benefit over a wide range of workloads and system configurations due to their inherent inability to take multiple different types of program context and system-level feedback information into account while prefetching. In this paper, we make a case for designing a holistic prefetch algorithm that learns to prefetch using multiple different types of program context and system-level feedback information inherent to its design. To this end, we propose Pythia, which formulates the prefetcher as a reinforcement learning agent. For every demand request, Pythia observes multiple different types of program context information to make a prefetch decision. For every prefetch decision, Pythia receives a numerical reward that evaluates prefetch quality under the current memory bandwidth usage. Pythia uses this reward to reinforce the correlation between program context information and prefetch decision to generate highly accurate, timely, and systemaware prefetch requests in the future. Our extensive evaluations using simulation and hardware synthesis show that Pythia outperforms two state-of-the-art prefetchers (MLOP and Bingo) by 3.4% and 3.8% in single-core, 7.7% and 9.6% in twelve-core, and 16.9% and 20.2% in bandwidth-constrained core configurations, while incurring only 1.03% area overhead over a desktop-class processor and no software changes in workloads. The source code of Pythia can be freely downloaded from https://github.com/CMU-SAFARI/Pythia. ...