Circular Image

J.S.S.M. Wong

info

Please Note

47 records found

Conference paper (2025) - Folkert De Ronde, Stephan Wong, Sebastian Feld
Advances in quantum algorithms as well as in control hardware designs are continuously being made. These quantum algorithms, expressed as quantum circuits, need to be translated to a set of instructions from a defined quantum instruction-set architecture (ISA), which are executed by the control hardware. These translations can be done by a compiler, targeting different qubit technologies. Specifically for diamond NV centers, no compiler exists to perform this translation. Therefore, in this paper we present a compiler designed for quantum computers utilizing diamond NV center specific instructions, such as direct carbon control and partial swaps, to reduce execution times and gate count. Additionally, our compiler adds on top of general compilers by allowing classical instructions to perform state tomography and measurement-based operations. The output of the compiler is tested in a diamond NV center specific simulator. Comparing a general compiler output with the diamond NV center specific output of our compiler while applying decoherence and depolarization noise showed reduced noise effects due to diamond specific decomposition. The compiler was also tested to perform state tomography and measurement-based operations, which showed to be functional. Our results show that we have successfully created a compiler with integrated classical and quantum instructions support, which can improve circuit execution fidelity by utilizing diamond specific optimizations. ...
Conference paper (2024) - Taha Shahroodi, Raphael Cardoso, Stephan Wong, Alberto Bosio, Ian O'Connor, Said Hamdioui
State-of-the-Art (SotA) hardware implementations of Deep Neural Networks (DNNs) incur high latencies and costs. Binary Neural Networks (BNNs) are potential alternative solutions to realize faster implementations without losing accuracy. In this paper, we first present a new data mapping, called TacitMap, suited for BNNs implemented based on a Computation-In-Memory (CIM) architecture. TacitMap maximizes the use of available parallelism, while CIM architecture eliminates the data movement overhead. We then propose a hardware accelerator based on optical phase change memory (oPCM) called EinsteinBarrier. Ein-steinBarrier incorporates TacitMap and adds an extra dimension for parallelism through wavelength division multiplexing, leading to extra latency reduction. The simulation results show that, compared to the SotA CIM baseline, TacitMap and EinsteinBarrier significantly improve execution time by up to ∼ 154× and ∼ 3113×, respectively, while also maintaining the energy consumption within 60% of that in the CIM baseline. ...

Efficient Implementation of Binary Neural Network Based on Computation in Memory

Journal article (2024) - Mahdi Zahedi, Taha Shahroodi, Carlos Escuin, Georgi Gaydadjiev, Stephan Wong, Said Hamdioui
Applications of Binary Neural Networks (BNNs) are promising for embedded systems with hard constraints on energy and computing power. Contrary to conventional neural networks using floating-point datatypes, BNNs use binarized weights and activations to reduce memory and computation requirements. Memristors, emerging non-volatile memory devices, show great potential as a target implementation platform for BNNs by integrating storage and compute units. However, the efficiency of this hardware highly depends on how the network is mapped and executed on these devices. In this paper, we propose an efficient implementation of XNOR-based BNN to maximize parallelization. In this implementation, costly analog-to-digital converters are replaced with sense amplifiers with custom reference(s) to generate activation values. Besides, a novel mapping is introduced to minimize the overhead of data communication between convolution layers mapped to different memristor crossbars. This comes with extensive analytical and simulation-based analysis to evaluate the implication of different design choices considering the accuracy of the network. The results show that our approach achieves up to 5× energy-saving and 100× improvement in latency compared to baselines. ...

Energy-efficient Design for In-memory Sparse-based Graph Processing

Performing analysis on large graph datasets in an energy-efficient manner has posed a significant challenge; not only due to excessive data movements and poor locality, but also due to the non-optimal use of high sparsity of such datasets. The latter leads to a waste of resources as the computation is also performed on zero's operands which do not contribute to the final result. This paper designs a novel graph processing accelerator, SparseMEM, targeting sparse datasets by leveraging the computing-in-memory (CIM) concept; CIM is a promising solution to alleviate the overhead of data movement and the inherent poor locality of graph processing. The proposed solution stores the graph information in a compressed hierarchical format inside the memory and adjusts the workflow based on this new mapping. This vastly improves resource utilization, leading to higher energy and permanence efficiency. The experimental results demonstrate that SparseMEM outperforms a GPU-based platform and two state-of-the-art in-memory accelerators on speedup and energy efficiency by one and three orders of magnitude, respectively. ...
Conference paper (2023) - Taha Shahroodi, Rafaela Cardoso, Mahdi Zahedi, Stephan Wong, Alberto Bosio, Ian O'Connor, Said Hamdioui
This paper investigates the potential of a compute-in-memory core based on optical Phase Change Materials (oPCMs) to speed up and reduce the energy consumption of the Matrix-Matrix-Multiplication operation. The paper also proposes a new data mapping for Binary Neural Networks (BNNs) tailored for our oPCM core. The preliminary results show a significant latency improvement irrespective of the evaluated network structure and size. The improvement varies from network to network and goes up to ~1053x. ...
The vast potential of memristor-based computation-in-memory (CIM) engines has mainly triggered the mapping of best-suited applications. Nevertheless, with additional support, existing applications can also benefit from CIM. In particular, this paper proposes an energy and area-efficient CIM-based methodology to perform arithmetic signed matrix multiplications. Our approach combines a) the mapping of the signed operands on the 1T1R crossbar, and b) the augmentation of the periphery with customized circuits to support the execution of shift and accumulate needed for the arithmetic operations. The operand mapping is performed without the need for sign extension; hence, reducing the required memory size. To demonstrate the superiority of our scheme as compared with the state-of-the-art, simulations are performed for different case studies including a neural network and two kernels which are taken from the Polybench/C benchmark suite. The results show that our approach achieves up to 8× energy-saving and 3× area-saving compared with other CIM-based prior works. ...
The high execution time of DNA sequence alignment negatively affects many genomic studies that rely on sequence alignment results. Pre-alignment filtering was introduced as a step before alignment to reduce the execution time of short-read sequence alignment greatly. With its success, i.e., achieving high accuracy and thus removing unnecessary alignments, the filtering itself now constitutes the larger portion of the execution time. A significant contributing factor entails the movement of sequences from the memory to the processing units, while a majority will filter out as they do not result in an acceptable alignment. State-of-the-art (SotA) pre-alignment filtering accelerators suffer from the same overhead for data movements. Furthermore, these accelerators lack support for future pre-alignment filtering algorithms using the same operations and underlying hardware. This paper addresses these shortcomings by introducing SieveMem. SieveMem is an architecture that exploits the Computation-in-Memory paradigm with memristive-based devices to support shared kernels of pre-alignment filters and algorithms inside the memory (i.e., preventing data movements). SieveMem architecture also provides support for future algorithms. SieveMem supports more than 47.6% of shared operations among all top 5 SotA filters. Moreover, SieveMem includes a hardware-friendly pre-alignment filtering algorithm called BandedKrait, inspired by a combination of mentioned kernels. Our evaluations show that SieveMem provides up to 331.1 x and 446.8 × improvement in the execution time of the two most-common kernels. Our evaluations also show that BandedKrait provides accuracy at the SotA level. Using BandedKrait on SieveMem, a design we call Mem-BandedKrait, one can improve the execution time of end-to-end sequence alignment irrespective of the dataset, which can go up to 91.4 × compared to the SotA accelerator on GPU. ...

From Primitive to Complex Functions

In recent years, we are witnessing a trend moving away from conventional computer architectures towards Computation-In-Memory (CIM) based on emerging memristor devices. This is due to the fact that the performance and energy efficiency of traditional computer architectures can no longer be increased at the same pace as before. The main barriers which limit the performance and energy improvement are the memory and power walls. Thus far, the main effort from researchers is toward enabling CIM as an accelerator for specific applications. Consequently, this current application-specific nature/approach has put less emphasis on the potential general-purpose applicability of CIM, i.e., merging several accelerators into one that is less than the sum of the parts. In this paper, we demonstrate the CIM concept using a broader and generalized model. Considering this model, the state-of-the-art CIM-based logic and arithmetic primitive functions, which can be the building blocks for complex functions, are investigated. Besides, we present potential applications of CIM which provides insights into the challenges and opportunities of a generic CIM system design. Finally, we highlight the future directions regarding the construction of CIM-based systems. ...

Tile Architecture and Simulator for Memristor-based Computation-in-memory

Journal article (2022) - Mahdi Zahedi, Muah Abu Lebdeh, Christopher Bengel, Dirk Wouters, Stephan Menzel, Manuel Le Gallo, Abu Sebastian, Stephan Wong, Said Hamdioui
In recent years, we are witnessing a trend toward in-memory computing for future generations of computers that differs from traditional von-Neumann architecture in which there is a clear distinction between computing and memory units. Considering that data movements between the central processing unit (CPU) and memory consume several orders of magnitude more energy compared to simple arithmetic operations in the CPU, in-memory computing will lead to huge energy savings as data no longer needs to be moved around between these units. In an initial step toward this goal, new non-volatile memory technologies, e.g., resistive RAM (ReRAM) and phase-change memory (PCM), are being explored. This has led to a large body of research that mainly focuses on the design of the memory array and its peripheral circuitry. In this article, we mainly focus on the tile architecture (comprising a memory array and peripheral circuitry) in which storage and compute operations are performed in the (analog) memory array and the results are produced in the (digital) periphery. Such an architecture is termed compute-in-memory-periphery (CIM-P). More precisely, we derive an abstract CIM-tile architecture and define its main building blocks. To bridge the gap between higher-level programming languages and the underlying (analog) circuit designs, an instruction-set architecture is defined that is intended to control and, in turn, sequence the operations within this CIM tile to perform higher-level more complex operations. Moreover, we define a procedure to pipeline the CIM-tile operations to further improve the performance. To simulate the tile and perform design space exploration considering different technologies and parameters, we introduce the fully parameterized first-of-its-kind CIM tile simulator and compiler. Furthermore, the compiler is technology-aware when scheduling the CIM-tile instructions. Finally, using the simulator, we perform several preliminary design space explorations regarding the three competing technologies, ReRAM, PCM, and STT-MRAM concerning CIM-tile parameters, e.g., the number of ADCs. Additionally, we investigate the effect of pipelining in relation to the clock speeds of the digital periphery assuming the three technologies. In the end, we demonstrate that our simulator is also capable of reporting energy consumption for each building block within the CIM tile after the execution of in-memory kernels considering the data-dependency on the energy consumption of the memory array. All the source codes are publicly available. ...

A Fast and Energy-Efficient Food Profiler Using Hyperdimensional Computing in Memory

Journal article (2022) - Taha Shahroodi, Mahdi Zahedi, Can Firtina, Mohammed Alser, Stephan Wong, Onur Mutlu, Said Hamdioui
Food profiling is an essential step in any food monitoring system needed to prevent health risks and potential frauds in the food industry. Significant improvements in sequencing technologies are pushing food profiling to become the main computational bottleneck. State-of-the-art profilers are unfortunately too costly for food profiling. Our goal is to design a food profiler that solves the main limitations of existing profilers, namely (1) working on massive data structures and (2) incurring considerable data movement, for a real-time monitoring system. To this end, we propose Demeter, the first platform-independent framework for food profiling. Demeter overcomes the first limitation through the use of hyperdimensional computing (HDC) and efficiently performs the accurate few-species classification required in food profiling. We overcome the second limitation by the use of an in-memory hardware accelerator for Demeter (named Acc-Demeter) based on memristor devices. Acc-Demeter actualizes several domain-specific optimizations and exploits the inherent characteristics of memristors to improve the overall performance and energy consumption of Acc-Demeter. We compare Demeter’s accuracy with other industrial food profilers using detailed software modeling. We synthesize Acc-Demeter’s required hardware using UMC’s 65nm library by considering an accurate PCM model based on silicon-based prototypes. Our evaluations demonstrate that Acc-Demeter achieves a (1) throughput improvement of 192× and 724× and (2) memory reduction of 36× and 33× compared to Kraken2 and MetaCache (2 state-of-the-art profilers), respectively, on typical food-related databases. Demeter maintains an acceptable profiling accuracy (within 2% of existing tools) and incurs a very low area overhead. ...

A Memristor-Augmented HW/SW Framework for Taxonomic Profiling

State-of-the-art taxonomic profilers that comprise the first step in larger-context metagenomic studies have proven to be computationally intensive, i.e., while accurate, they come at the cost of high latency and energy consumption. Table Lookup operation is a primary bottleneck of today's profilers. In this paper, we first propose TL-PIM, a hardware accelerator based on the processing-in-memory (PIM) paradigm to accelerate Table Lookup. TL-PIM leverages the in-memory compute capability of emerging memory technologies along with intelligent data mapping. Then, we integrate TL-PIM into Kraken2, a state-of-the-art metagenomic profiler, and build an HW/SW co-designed profiler, called KrakenOnMem. Results from a silicon-based prototype of our emerging memory validate the design and required operations on a smaller scale. Our large-scale calibrated simulations show that KrakenOnMem can provide an average of 61.3% speedup compared to original Kraken2 for end-to-end profiling. Additionally, our design improves the energy consumption by orders of magnitude compared to the original Kraken2 while incurring a negligible area overhead. ...
Computation-in-memory (CIM) shows great promise for specific applications by employing emerging (non-volatile) memory technologies such as memristors for both storage and compute, greatly reducing energy consumption, and improving performance. Based on our own observations, we can clearly perceive the contours of a generic approach encompassing the use of a memristor array – using technologies such as PCM and ReRAM. In this paper, we present a new instruction-set architecture (ISA) to control a single CIM-tile that comprises the analog memory array itself and all necessary analog and digital periphery. The newly introduced ISA provides the following advantages: (1) flexibility in programming new CIM functionalities by simply rescheduling the instructions from the ISA, (2) definition of a simulation framework, (3) a hardware implementation of the digital periphery, and (4) a design-space exploration of specific CIM-tile operations targeting the aforementioned technologies. For (1), we defined our own compiler that can translate CIM-tile operations to a sequence of instructions from our ISA. The implementation of the digital periphery is synthesized with the 15 nm Nangate library and results regarding power/energy and area are presented. Finally, the design-space exploration is made possible by using the technology-specific parameters with values that have been verified by accurate technology models. All codes of the compiler and simulator as well as the HDL code of the digital periphery are publicly available. ...
Conference paper (2021) - H. den Boer, R.W.D. Muller, J.S.S.M. Wong, V. Voogt
A key obstacle within the design of cognitive radios has always been the spectrum sensing component that implements the function automatic modulation classification (AMC). With the transition to software-defined radios (SDRs) followed by the introduction of field-programmable gate arrays (FPGAs) and deep learning (DL), it becomes possible to surmount this obstacle. However, the design of DL models is still detached from synthesized FPGA designs in current implementation frameworks. Consequently, the design process is a tedious and lengthy one. In this paper, a novel implementation framework is presented for implementing deep learning inference models within signal processing chains on FPGAs. The framework focuses on optimization for radio-frequency (RF) transceiver applications, aiming for high-throughput, low latency and a small FPGA resource footprint enabling the scaling to larger DL models. Demonstration of the implementation framework for automatic modulation classification (AMC) results in an operational throughput of 585k classifications per second. ...
Von Neumann-based architectures suffer from costly communication between CPU and memory. This communication imposes several orders of magnitude more power and performance overheads compared to the arithmetic operations performed by the processor. This overhead becomes critical for applications that require processing a large amount of data. Computation-in-Memory (CIM) leveraging memristor devices in the crossbar structure offers a potential solution to tackle this challenge. However, support for the integer data type is lacking in CIM approaches as most solutions operate on a single/few bits only. This paper proposes a new organization of the periphery (next to memristor crossbar) to compute matrix-matrix multiplication (MMM) at the tile level. More precisely, the analog additions performed in the crossbar is complemented with additions performed in the digital periphery. In this mixed analog-digital system, digital additions are performed in a way that only the minimum size of adders are required-this is to reduce the latency of the digital periphery as much as possible. In addition, the design is customized to the number of ADCs as well as datatype sizes to support different possible scenarios. The results show that our organization reduces energy and latency up to 50x and 3x, respectively, compared to the reference design. ...
Conference paper (2019) - Anderson Luiz Sartor, Pedro Henrique Exenberger Becker, Stephan Wong, Radu Marculescu, Antonio Carlos Schneider Beck
Adaptive processors can dynamically change their hardware configuration by tuning several knobs that optimize a given metric, according to the current application. However, the complexity of choosing the best setup at runtime increases exponentially as more adaptive resources become available. Therefore, we propose a polymorphic VLIW processor coupled to a machine learning-based decision mechanism that quickly and accurately delivers the best trade-off in terms of energy, performance, and reliability. The proposed system predicts the best processor configuration in 97.37% of the test cases and achieves an efficiency that is close to an oracle (more than 93.30% on all benchmarks). ...

Computation in Memory SIMuIator

Conference paper (2019) - Ali Banagozar, Kanishkan Vadivel, Sander Stuijk, Henk Corporaal, Stephan Wong, Muath Abu Lebdeh, Jintao Yu, Said Hamdioui
Computation-in-memory reverses the trend in von-Neumann processors by bringing the computation closer to the data, to even within the memory array, as opposed to introducing new memory hierarchies to keep (frequently used) data closer to a central processing unit (CPU). In recent years, new non-volatile memory (NVM) technologies, e.g., memristor, PCM, etc., have proven that they can function as memories and perform computations on the stored data as well. In particular, when they are combined with a modest set of (digital) peripheral modules, a wider range of operations can be supported, e.g., vector matrix multiply and Boolean logic. In this paper, we are introducing the CIM-SIM, an open source simulator written in SystemC, which is capable of simulating the functional behaviour of such architectures. The architecture includes the definition of a set of technology-agnostic nano-instructions. ...
Conference paper (2019) - Muath Abu Lebdeh, Uljana Reinsalu, Hoang Anh Du Nguyen, Stephan Wong, Said Hamdioui
Emerging computing applications (such as big-data and Internet-of-things) are extremely demanding in terms of storage, energy and computational efficiency, while today’s architectures and device technologies are facing major challenges making them incapable to meet these demands. Computation-in-Memory (CIM) architecture based on memristive devices is one of the alternative computing architectures being explored to address these limitations. Enabling such architectures relies on the development of efficient memristive circuits being able to perform logic and arithmetic operations within the non-volatile memory core. This paper addresses memristive circuit designs for CIM architectures. It gives a complete overview of all designs, both for logic as well as arithmetic operations, and presents the most popular designs in details. In addition, it analyzes and classifies them, shows how they result in different CIM flavours and how these architectures distinguish themselves from traditional ones. The paper also presents different potential applications that could significantly benefit from CIM architectures, based on their kernel that could be accelerated. ...
Journal article (2018) - Anderson L. Sartor, Pedro H. E. Becker, Joost Hoozemans, Stephan Wong, Antonio C.S. Beck
In the design of modern-day processors, energy consumption and fault tolerance have gained significant importance next to performance. This is caused by battery constraints, thermal design limits, and higher susceptibility to errors as transistor feature sizes are decreasing. However, achieving the ideal balance among them is challenging due to their conflicting nature (e.g., fault-tolerance techniques usually influence execution time or increase energy consumption), and that is why current processor designs target at most two of these axes. Based on that, we propose a new VLIW-based processor design capable of adapting the execution of the application at run-time in a totally transparent fashion, considering performance, fault tolerance, and energy consumption altogether, in which the weight (priority) of each one can be defined a priori. This is achieved by a novel decision module that dynamically controls the application's ILP to increase the possibility of replicating instructions or applying power gating. For an energy-oriented configuration, it is possible, on average, to reduce energy consumption by 37.2% with an overhead of only 8.2% in performance, while maintaining low levels of failure rate, when compared to a fault-tolerant design. ...

Exploiting Design Time Configurability and Runtime Reconfigurability

Conference paper (2018) - Jeckson D. Souza, Anderson L. Sartor, Luigi Carro, Mateus Beck Rutzig, Stephan Wong, Antonio C.S. Beck
Embedded processors must efficiently deliver performance at low energy consumption. Both configurable and reconfigurable techniques can be used to fulfill such constraints, although applied in different situations. In this work, we propose DIM-VEX, a configurable processor coupled with a reconfigurable fabric, which can leverage both design time configurability and runtime reconfigurability. We show that, on average, such system can improve performance by up to 1.41X and reduce energy by up to 60% when compared to a configurable processor at the cost of additional area.
...
Journal article (2018) - Joost Hoozemans, Jeroen van Straten, Stephan Wong
Mixed-criticality systems need to provide strict guarantees to hard real-time tasks and simultaneously, deliver high throughput for non-critical tasks. However, techniques to enhance performance more often than not affect the analyzability, e.g., caches, branch prediction, out-of-order (OoO) execution superscalar processing, and simultaneous multithreading (SMT). In this paper, we propose the use of a polymorphic VLIW processor to increase performance for non-critical tasks while maintaining analyzability. The processor achieves these goals by dynamically distributing computing resources (in the form of datapaths) to one or multiple threads. A static schedule guarantees the minimum amount of cycles to meet the deadlines for critical tasks. Datapaths that are not used by critical tasks can be assigned to non-critical tasks in a highly flexible way, thereby increasing resource utilization resulting in higher throughput. Our experiments show that our approach can exploit its dynamic properties to improve schedulability and assign up to 50% and on average 25% more resources to lower-priority threads during the execution of a static real-time schedule. ...