T. Shahroodi | TU Delft Repository

Efficient Memory Layout for Pre-Alignment Filtering of Long DNA Reads Using Racetrack Memory

Journal article (2024) - Asif Ali Khan (author) , Fazal Hameed (author) , Taha Shahroodi (author) , Alex E. Jones (author) , Jeronimo Castrillon (author)

DNA sequence alignment is a fundamental and computationally expensive operation in bioinformatics. Researchers have developed pre-alignment filters that effectively reduce the amount of data consumed by the alignment process by discarding locations that result in a poor match. Ho ...

ApHMM

Accelerating Profile Hidden Markov Models for Fast and Energy-efficient Genome Analysis

Journal article (2024) - Can Firtina (author) , Kamlesh Pillai (author) , Gurpreet S. Kalsi (author) , Bharathwaj Suresh (author) , Damla Senol Cali (author) , Jeremie S. Kim (author) , Taha Shahroodi (author) , Meryem Banu Cavlak (author) , Joël Lindegger (author) , G.B. More Authors (author)

Profile hidden Markov models (pHMMs) are widely employed in various bioinformatics applications to identify similarities between biological sequences, such as DNA or protein sequences. In pHMMs, sequences are represented as graph structures, where states and edges capture modific ...

Profile hidden Markov models (pHMMs) are widely employed in various bioinformatics applications to identify similarities between biological sequences, such as DNA or protein sequences. In pHMMs, sequences are represented as graph structures, where states and edges capture modifications (i.e., insertions, deletions, and substitutions) by assigning probabilities to them. These probabilities are subsequently used to compute the similarity score between a sequence and a pHMM graph. The Baum-Welch algorithm, a prevalent and highly accurate method, utilizes these probabilities to optimize and compute similarity scores. Accurate computation of these probabilities is essential for the correct identification of sequence similarities. However, the Baum-Welch algorithm is computationally intensive, and existing solutions offer either software-only or hardware-only approaches with fixed pHMM designs. When we analyze state-of-the-art works, we identify an urgent need for a flexible, high-performance, and energy-efficient hardware-software co-design to address the major inefficiencies in the Baum-Welch algorithm for pHMMs. We introduce ApHMM, the first flexible acceleration framework designed to significantly reduce both computational and energy overheads associated with the Baum-Welch algorithm for pHMMs. ApHMM employs hardware-software co-design to tackle the major inefficiencies in the Baum-Welch algorithm by (1) designing flexible hardware to accommodate various pHMM designs, (2) exploiting predictable data dependency patterns through on-chip memory with memoization techniques, (3) rapidly filtering out unnecessary computations using a hardware-based filter, and (4) minimizing redundant computations. ApHMM achieves substantial speedups of 15.55×–260.03×, 1.83×–5.34×, and 27.97× when compared to CPU, GPU, and FPGA implementations of the Baum-Welch algorithm, respectively. ApHMM outperforms state-of-the-art CPU implementations in three key bioinformatics applications: (1) error correction, (2) protein family search, and (3) multiple sequence alignment, by 1.29×–59.94×, 1.03×–1.75×, and 1.03×–1.95×, respectively, while improving their energy efficiency by 64.24×–115.46×, 1.75×, and 1.96×.

AFSRAM-CIM: Adder Free SRAM-Based Digital Computation-in-Memory for BNN

Conference paper (2024) - Asmae El Arrassi (author) , M.A. Yaldagard (author) , Xingjian Tao (author) , Taha Shahroodi (author) , F.J. Mir (author) , Yashvardhan Biyani (author) , Manil Dev Gomony (author) , A.B. Gebregiorgis (author) , Rajiv Joshi (author) , S Hamdioui (author)

Binary Neural Networks (BNNs) have demonstrated significant advantages in reducing computation and memory costs, all while maintaining acceptable accuracy on various image detection tasks. Thus, BNNs have the potential to support practical cognitive tasks on resource-constrained ...

Computation-in-Memory for Modern Applications using Emerging Technologies

Doctoral thesis (2024) - Taha Shahroodi (author)

Modern applications like Genomics and Machine Learning (ML) hold the potential to reshape our understanding of diseases’ genetic origins and guide machines in executing tasks and making predictions without our explicit programming. The successful, widespread integration of these ...

Modern applications like Genomics and Machine Learning (ML) hold the potential to reshape our understanding of diseases’ genetic origins and guide machines in executing tasks and making predictions without our explicit programming. The successful, widespread integration of these modern applications can usher in advancements in di-agnostics, individualized medicine, and routine tasks such as language interpretation, image analysis, and object categorization. However, our traditional computing infrastructures fall short when accommodating the distinct characteristics of these new applications. Specifically, (1) these applications handle an immense and ever-expanding data working set, and (2) each succeeding version of these applications and their associated use cases necessitates quicker and more energy-efficient analysis of these vast data sets. This is because our traditional computing systems largely hinge on (1) the von-Neumann architecture, a design that distinctly positions processing entities (like CPUs and GPUs) away from storage components (like memories and flash drives), and (2) the CMOS-based technology. While attempting to meet the performance and energy demands of our modern applications, these fully CMOS-based systems based on von-Neumann architecture have increasingly struggled and hit inherent roadblocks, with data movement overhead being the predominant issue. To alleviate the data movement bottleneck, contemporary research revisits a concept historically known as Computation-In-Memory (CIM) or, alternatively, Processing-In-Memory (PIM). At its core, CIM emphasizes positioning computational capabilities close to, or within, the memory units storing the data. This placement might be within memory chips, in memory controllers, amid caches, or embedded in the logic layers of 3D-stacked memories. As a computational model, architectures leveraging CIM (referred to as CIM architectures) stand to tackle the issue of data movement overhead inherent in the von-Neumann architecture by diminishing or outright eradicating the data movement between computational locales and data storage areas. Moreover, from a techno-logical perspective, emerging memory technologies, including memristive devices and circuits, show potential to replace traditional memory systems, addressing some of the challenges posed by CMOS-based designs. Irrespective of the specific CIM architecture deployed to optimize performance or energy efficiency in modern applications, there are substantial practical challenges to address and ponder upon first. Both system designers and developers face these hurdles and design decisions, which are critical to surmount CIM’s widespread acceptance across various computational areas and application domains. In this dissertation, our focus is twofold: (1) We delve into the acceleration and streamlined execution of various steps in two pivotal application realms: genomics and ML; and (2) We explore several emerging memory technologies alongside circuit and architectural strategies, that show promise in enhancing CIM designs, specifically tailored for modern applications. Therefore, in this thesis, we identify and propose strategies and designs to ameliorate the constrained performance of key kernels in genomics and ML. Recognizing that applications within these realms consist of diverse functions or kernels, it is imperative for a designer to possess a thorough understanding of them. Each function/kernel can be characterized by distinct data and control flows, calling for varied features to be enabled in either a von-Neumann or a CIM architecture. To enhance the efficacy of each function/kernel, we first profile them individually and then within a larger context of their corresponding pipeline, followed by discerning the best avenues for their memory mapping in a CIM architecture. We then undertake a concurrent assessment of essential adjunct components alongside the memory array, commonly referred to as the peripheries. For a designer, proficiency in the applications executable on a CIM system leveraging emerging memory technologies is indispensable. Grasping the fundamental characteristics of CIM and having an overarching view of its scope becomes vital prior to its integration. We aim to aggregate critical application features, improvement opportunities, and design decisions and refine them to their core essence. Through this, we aspire to shed light on present design options and identify kernels demanding heightened attention. Such insights can be instrumental in revealing prospective directions, encompassing supported kernels along with their respective merits and trade-offs. We exploit emerging technologies and architect state-of-the-art CIM designs that optimally serve the targeted kernels, keeping a holistic improvement perspective at the forefront. Delving into emerging (memory) technologies, such as memristive devices like PCM and STT-MRAM, is crucial. These devices provide a suite of advantages, including non-volatility, compactness, and a natural aptitude for conducting logical operations (for instance, the logical AND). Additionally, other emerging technologies, such as integrated photonics, have the potential to enhance the CIM paradigm further with their capacity for high-frequency and low-latency functions. Our ambition is to integrate multiple such technologies, harnessing their distinct attributes, to craft a CIM design that surpasses the SotA counterparts across key benchmarks, be it in execution speed or energy. This thesis demonstrates that when CIM is fused with emerging (memory) technologies, there is a marked enhancement in the performance of several Genomics pipelines and Machine Learning applications. It is our aspiration and conviction that the evaluations, methodologies, and findings detailed in this dissertation will empower the broader community to comprehend and address contemporary and upcoming challenges that revolve around enhancing the performance and energy efficiency of modern applications through the integration of (re)emerging computing paradigms and technologies. Additionally, our work provides insights for adapting these technologies to novel applications, ensuring they deliver optimal benefits.

High-Performance Data Mapping for BNNs on PCM-Based Integrated Photonics

Conference paper (2024) - Taha Shahroodi (author) , Raphael Cardoso (author) , J.S.S.M. Wong (author) , A. Bosio (author) , Ian O'Connor (author) , S Hamdioui (author)

State-of-the-Art (SotA) hardware implementations of Deep Neural Networks (DNNs) incur high latencies and costs. Binary Neural Networks (BNNs) are potential alternative solutions to realize faster implementations without losing accuracy. In this paper, we first present a new data ...

SparseMEM

Energy-efficient Design for In-memory Sparse-based Graph Processing

Conference paper (2023) - M.Z. Zahedi (author) , Geert Custers (author) , Taha Shahroodi (author) , G. Gaydadjiev (author) , Stephan Wong (author) , Said Hamdioui (author)

Performing analysis on large graph datasets in an energy-efficient manner has posed a significant challenge; not only due to excessive data movements and poor locality, but also due to the non-optimal use of high sparsity of such datasets. The latter leads to a waste of resources ...

Efficient Signed Arithmetic Multiplication on Memristor-based Crossbar

Journal article (2023) - Mahdi Zahedi (author) , Taha Shahroodi (author) , S. Wong (author) , Said Hamdioui (author)

The vast potential of memristor-based computation-in-memory (CIM) engines has mainly triggered the mapping of best-suited applications. Nevertheless, with additional support, existing applications can also benefit from CIM. In particular, this paper proposes an energy and area-ef ...

Lightspeed Binary Neural Networks using Optical Phase-Change Materials

Conference paper (2023) - Taha Shahroodi (author) , Rafaela Cardoso (author) , Mahdi Zahedi (author) , Stephan Wong (author) , A. Bosio (author) , I. O'Connor (author) , S. Hamdioui (author)

This paper investigates the potential of a compute-in-memory core based on optical Phase Change Materials (oPCMs) to speed up and reduce the energy consumption of the Matrix-Matrix-Multiplication operation. The paper also proposes a new data mapping for Binary Neural Networks (BN ...

SieveMem: A Computation-in-Memory Architecture for Fast and Accurate Pre-Alignment

Conference paper (2023) - Taha Shahroodi (author) , Michael Miao (author) , M.Z. Zahedi (author) , J.S.S.M. Wong (author) , Said Hamdioui (author)

The high execution time of DNA sequence alignment negatively affects many genomic studies that rely on sequence alignment results. Pre-alignment filtering was introduced as a step before alignment to reduce the execution time of short-read sequence alignment greatly. With its suc ...

pLUTo

Enabling Massively Parallel Computation in DRAM via Lookup Tables

Conference paper (2022) - Joao Dinis Ferreira (author) , Gabriel Falcao (author) , Juan Gomez-Luna (author) , Mohammed Alser (author) , Lois Orosa (author) , Mohammad Sadrosadati (author) , Jeremie S. Kim (author) , Geraldo F. Oliveira (author) , Taha Shahroodi (author) , G.B. More Authors (author)

Data movement between the main memory and the processor is a key contributor to execution time and energy consumption in memory-intensive applications. This data movement bottleneck can be alleviated using Processing-in-Memory (PiM). One category of PiM is Processing-using-Memory ...

System Design for Computation-in-Memory

From Primitive to Complex Functions

Conference paper (2022) - M.Z. Zahedi (author) , Taha Shahroodi (author) , Geert Custers (author) , Abhairaj Singh (author) , J.S.S.M. Wong (author) , S. Hamdioui (author)

In recent years, we are witnessing a trend moving away from conventional computer architectures towards Computation-In-Memory (CIM) based on emerging memristor devices. This is due to the fact that the performance and energy efficiency of traditional computer architectures can no ...

Demeter

A Fast and Energy-Efficient Food Profiler Using Hyperdimensional Computing in Memory

Journal article (2022) - Taha Shahroodi (author) , M.Z. Zahedi (author) , Can Firtina (author) , Mohammed Alser (author) , J.S.S.M. Wong (author) , Onur Mutlu (author) , S Hamdioui (author)

Food profiling is an essential step in any food monitoring system needed to prevent health risks and potential frauds in the food industry. Significant improvements in sequencing technologies are pushing food profiling to become the main computational bottleneck. State-of-the-art ...

KrakenOnMem

A Memristor-Augmented HW/SW Framework for Taxonomic Profiling

Conference paper (2022) - Taha Shahroodi (author) , Mahdi Zahedi (author) , A. Singh (author) , S. Wong (author) , S Hamdioui (author)

State-of-the-art taxonomic profilers that comprise the first step in larger-context metagenomic studies have proven to be computationally intensive, i.e., while accurate, they come at the cost of high latency and energy consumption. Table Lookup operation is a primary bottleneck ...

CIM-based Robust Logic Accelerator using 28 nm STT-MRAM Characterization Chip Tape-out

Conference paper (2022) - A. Singh (author) , Mahdi Zahedi (author) , Taha Shahroodi (author) , Mohit Gupta (author) , Anteneh Gebregiorgis (author) , Manu Komalan (author) , R.V. Joshi (author) , F. Catthoor (author) , Rajendra bishnoi (author) , S Hamdioui (author)

Spin-transfer torque magnetic random access memory (STT-MRAM) based computation-in-memory (CIM) architectures have shown great prospects for an energy-efficient computing. However, device variations and non-idealities narrow down the sensing margin that severely impacts the compu ...

Pythia

A customizable hardware prefetching framework using online reinforcement learning

Conference paper (2021) - Rahul Bera (author) , Konstantinos Kanellopoulos (author) , Anant V. Nori (author) , Taha Shahroodi (author) , Sreenivas Subramoney (author) , Onur Mutlu (author)

Past research has proposed numerous hardware prefetching techniques, most of which rely on exploiting one specific type of program context information (e.g., program counter, cacheline address, or delta between cacheline addresses) to predict future memory accesses. These techniq ...

Past research has proposed numerous hardware prefetching techniques, most of which rely on exploiting one specific type of program context information (e.g., program counter, cacheline address, or delta between cacheline addresses) to predict future memory accesses. These techniques either completely neglect a prefetcher's undesirable effects (e.g., memory bandwidth usage) on the overall system, or incorporate system-level feedback as an afterthought to a system-unaware prefetch algorithm.We showthat prior prefetchers often lose their performance benefit over a wide range of workloads and system configurations due to their inherent inability to take multiple different types of program context and system-level feedback information into account while prefetching. In this paper, we make a case for designing a holistic prefetch algorithm that learns to prefetch using multiple different types of program context and system-level feedback information inherent to its design. To this end, we propose Pythia, which formulates the prefetcher as a reinforcement learning agent. For every demand request, Pythia observes multiple different types of program context information to make a prefetch decision. For every prefetch decision, Pythia receives a numerical reward that evaluates prefetch quality under the current memory bandwidth usage. Pythia uses this reward to reinforce the correlation between program context information and prefetch decision to generate highly accurate, timely, and systemaware prefetch requests in the future. Our extensive evaluations using simulation and hardware synthesis show that Pythia outperforms two state-of-the-art prefetchers (MLOP and Bingo) by 3.4% and 3.8% in single-core, 7.7% and 9.6% in twelve-core, and 16.9% and 20.2% in bandwidth-constrained core configurations, while incurring only 1.03% area overhead over a desktop-class processor and no software changes in workloads. The source code of Pythia can be freely downloaded from https://github.com/CMU-SAFARI/Pythia.