J.S.S.M. Wong
Please Note
47 records found
1
Advances in quantum algorithms as well as in control hardware designs are continuously being made. These quantum algorithms, expressed as quantum circuits, need to be translated to a set of instructions from a defined quantum instruction-set architecture (ISA), which are executed by the control hardware. These translations can be done by a compiler, targeting different qubit technologies. Specifically for diamond NV centers, no compiler exists to perform this translation. Therefore, in this paper we present a compiler designed for quantum computers utilizing diamond NV center specific instructions, such as direct carbon control and partial swaps, to reduce execution times and gate count. Additionally, our compiler adds on top of general compilers by allowing classical instructions to perform state tomography and measurement-based operations. The output of the compiler is tested in a diamond NV center specific simulator. Comparing a general compiler output with the diamond NV center specific output of our compiler while applying decoherence and depolarization noise showed reduced noise effects due to diamond specific decomposition. The compiler was also tested to perform state tomography and measurement-based operations, which showed to be functional. Our results show that we have successfully created a compiler with integrated classical and quantum instructions support, which can improve circuit execution fidelity by utilizing diamond specific optimizations.
State-of-the-Art (SotA) hardware implementations of Deep Neural Networks (DNNs) incur high latencies and costs. Binary Neural Networks (BNNs) are potential alternative solutions to realize faster implementations without losing accuracy. In this paper, we first present a new data mapping, called TacitMap, suited for BNNs implemented based on a Computation-In-Memory (CIM) architecture. TacitMap maximizes the use of available parallelism, while CIM architecture eliminates the data movement overhead. We then propose a hardware accelerator based on optical phase change memory (oPCM) called EinsteinBarrier. Ein-steinBarrier incorporates TacitMap and adds an extra dimension for parallelism through wavelength division multiplexing, leading to extra latency reduction. The simulation results show that, compared to the SotA CIM baseline, TacitMap and EinsteinBarrier significantly improve execution time by up to ∼ 154× and ∼ 3113×, respectively, while also maintaining the energy consumption within 60% of that in the CIM baseline.
BCIM
Efficient Implementation of Binary Neural Network Based on Computation in Memory
Applications of Binary Neural Networks (BNNs) are promising for embedded systems with hard constraints on energy and computing power. Contrary to conventional neural networks using floating-point datatypes, BNNs use binarized weights and activations to reduce memory and computation requirements. Memristors, emerging non-volatile memory devices, show great potential as a target implementation platform for BNNs by integrating storage and compute units. However, the efficiency of this hardware highly depends on how the network is mapped and executed on these devices. In this paper, we propose an efficient implementation of XNOR-based BNN to maximize parallelization. In this implementation, costly analog-to-digital converters are replaced with sense amplifiers with custom reference(s) to generate activation values. Besides, a novel mapping is introduced to minimize the overhead of data communication between convolution layers mapped to different memristor crossbars. This comes with extensive analytical and simulation-based analysis to evaluate the implication of different design choices considering the accuracy of the network. The results show that our approach achieves up to 5× energy-saving and 100× improvement in latency compared to baselines.
SparseMEM
Energy-efficient Design for In-memory Sparse-based Graph Processing
The vast potential of memristor-based computation-in-memory (CIM) engines has mainly triggered the mapping of best-suited applications. Nevertheless, with additional support, existing applications can also benefit from CIM. In particular, this paper proposes an energy and area-efficient CIM-based methodology to perform arithmetic signed matrix multiplications. Our approach combines a) the mapping of the signed operands on the 1T1R crossbar, and b) the augmentation of the periphery with customized circuits to support the execution of shift and accumulate needed for the arithmetic operations. The operand mapping is performed without the need for sign extension; hence, reducing the required memory size. To demonstrate the superiority of our scheme as compared with the state-of-the-art, simulations are performed for different case studies including a neural network and two kernels which are taken from the Polybench/C benchmark suite. The results show that our approach achieves up to 8× energy-saving and 3× area-saving compared with other CIM-based prior works.
The high execution time of DNA sequence alignment negatively affects many genomic studies that rely on sequence alignment results. Pre-alignment filtering was introduced as a step before alignment to reduce the execution time of short-read sequence alignment greatly. With its success, i.e., achieving high accuracy and thus removing unnecessary alignments, the filtering itself now constitutes the larger portion of the execution time. A significant contributing factor entails the movement of sequences from the memory to the processing units, while a majority will filter out as they do not result in an acceptable alignment. State-of-the-art (SotA) pre-alignment filtering accelerators suffer from the same overhead for data movements. Furthermore, these accelerators lack support for future pre-alignment filtering algorithms using the same operations and underlying hardware. This paper addresses these shortcomings by introducing SieveMem. SieveMem is an architecture that exploits the Computation-in-Memory paradigm with memristive-based devices to support shared kernels of pre-alignment filters and algorithms inside the memory (i.e., preventing data movements). SieveMem architecture also provides support for future algorithms. SieveMem supports more than 47.6% of shared operations among all top 5 SotA filters. Moreover, SieveMem includes a hardware-friendly pre-alignment filtering algorithm called BandedKrait, inspired by a combination of mentioned kernels. Our evaluations show that SieveMem provides up to 331.1 x and 446.8 × improvement in the execution time of the two most-common kernels. Our evaluations also show that BandedKrait provides accuracy at the SotA level. Using BandedKrait on SieveMem, a design we call Mem-BandedKrait, one can improve the execution time of end-to-end sequence alignment irrespective of the dataset, which can go up to 91.4 × compared to the SotA accelerator on GPU.
System Design for Computation-in-Memory
From Primitive to Complex Functions
MNEMOSENE
Tile Architecture and Simulator for Memristor-based Computation-in-memory
In recent years, we are witnessing a trend toward in-memory computing for future generations of computers that differs from traditional von-Neumann architecture in which there is a clear distinction between computing and memory units. Considering that data movements between the central processing unit (CPU) and memory consume several orders of magnitude more energy compared to simple arithmetic operations in the CPU, in-memory computing will lead to huge energy savings as data no longer needs to be moved around between these units. In an initial step toward this goal, new non-volatile memory technologies, e.g., resistive RAM (ReRAM) and phase-change memory (PCM), are being explored. This has led to a large body of research that mainly focuses on the design of the memory array and its peripheral circuitry. In this article, we mainly focus on the tile architecture (comprising a memory array and peripheral circuitry) in which storage and compute operations are performed in the (analog) memory array and the results are produced in the (digital) periphery. Such an architecture is termed compute-in-memory-periphery (CIM-P). More precisely, we derive an abstract CIM-tile architecture and define its main building blocks. To bridge the gap between higher-level programming languages and the underlying (analog) circuit designs, an instruction-set architecture is defined that is intended to control and, in turn, sequence the operations within this CIM tile to perform higher-level more complex operations. Moreover, we define a procedure to pipeline the CIM-tile operations to further improve the performance. To simulate the tile and perform design space exploration considering different technologies and parameters, we introduce the fully parameterized first-of-its-kind CIM tile simulator and compiler. Furthermore, the compiler is technology-aware when scheduling the CIM-tile instructions. Finally, using the simulator, we perform several preliminary design space explorations regarding the three competing technologies, ReRAM, PCM, and STT-MRAM concerning CIM-tile parameters, e.g., the number of ADCs. Additionally, we investigate the effect of pipelining in relation to the clock speeds of the digital periphery assuming the three technologies. In the end, we demonstrate that our simulator is also capable of reporting energy consumption for each building block within the CIM tile after the execution of in-memory kernels considering the data-dependency on the energy consumption of the memory array. All the source codes are publicly available.
Demeter
A Fast and Energy-Efficient Food Profiler Using Hyperdimensional Computing in Memory
KrakenOnMem
A Memristor-Augmented HW/SW Framework for Taxonomic Profiling
State-of-the-art taxonomic profilers that comprise the first step in larger-context metagenomic studies have proven to be computationally intensive, i.e., while accurate, they come at the cost of high latency and energy consumption. Table Lookup operation is a primary bottleneck of today's profilers. In this paper, we first propose TL-PIM, a hardware accelerator based on the processing-in-memory (PIM) paradigm to accelerate Table Lookup. TL-PIM leverages the in-memory compute capability of emerging memory technologies along with intelligent data mapping. Then, we integrate TL-PIM into Kraken2, a state-of-the-art metagenomic profiler, and build an HW/SW co-designed profiler, called KrakenOnMem. Results from a silicon-based prototype of our emerging memory validate the design and required operations on a smaller scale. Our large-scale calibrated simulations show that KrakenOnMem can provide an average of 61.3% speedup compared to original Kraken2 for end-to-end profiling. Additionally, our design improves the energy consumption by orders of magnitude compared to the original Kraken2 while incurring a negligible area overhead.
Von Neumann-based architectures suffer from costly communication between CPU and memory. This communication imposes several orders of magnitude more power and performance overheads compared to the arithmetic operations performed by the processor. This overhead becomes critical for applications that require processing a large amount of data. Computation-in-Memory (CIM) leveraging memristor devices in the crossbar structure offers a potential solution to tackle this challenge. However, support for the integer data type is lacking in CIM approaches as most solutions operate on a single/few bits only. This paper proposes a new organization of the periphery (next to memristor crossbar) to compute matrix-matrix multiplication (MMM) at the tile level. More precisely, the analog additions performed in the crossbar is complemented with additions performed in the digital periphery. In this mixed analog-digital system, digital additions are performed in a way that only the minimum size of adders are required-this is to reduce the latency of the digital periphery as much as possible. In addition, the design is customized to the number of ADCs as well as datatype sizes to support different possible scenarios. The results show that our organization reduces energy and latency up to 50x and 3x, respectively, compared to the reference design.
Adaptive processors can dynamically change their hardware configuration by tuning several knobs that optimize a given metric, according to the current application. However, the complexity of choosing the best setup at runtime increases exponentially as more adaptive resources become available. Therefore, we propose a polymorphic VLIW processor coupled to a machine learning-based decision mechanism that quickly and accurately delivers the best trade-off in terms of energy, performance, and reliability. The proposed system predicts the best processor configuration in 97.37% of the test cases and achieves an efficiency that is close to an oracle (more than 93.30% on all benchmarks).
CIM-SIM
Computation in Memory SIMuIator
Computation-in-memory reverses the trend in von-Neumann processors by bringing the computation closer to the data, to even within the memory array, as opposed to introducing new memory hierarchies to keep (frequently used) data closer to a central processing unit (CPU). In recent years, new non-volatile memory (NVM) technologies, e.g., memristor, PCM, etc., have proven that they can function as memories and perform computations on the stored data as well. In particular, when they are combined with a modest set of (digital) peripheral modules, a wider range of operations can be supported, e.g., vector matrix multiply and Boolean logic. In this paper, we are introducing the CIM-SIM, an open source simulator written in SystemC, which is capable of simulating the functional behaviour of such architectures. The architecture includes the definition of a set of technology-agnostic nano-instructions.
DIM-VEX
Exploiting Design Time Configurability and Runtime Reconfigurability
...
Embedded processors must efficiently deliver performance at low energy consumption. Both configurable and reconfigurable techniques can be used to fulfill such constraints, although applied in different situations. In this work, we propose DIM-VEX, a configurable processor coupled with a reconfigurable fabric, which can leverage both design time configurability and runtime reconfigurability. We show that, on average, such system can improve performance by up to 1.41X and reduce energy by up to 60% when compared to a configurable processor at the cost of additional area.
Mixed-criticality systems need to provide strict guarantees to hard real-time tasks and simultaneously, deliver high throughput for non-critical tasks. However, techniques to enhance performance more often than not affect the analyzability, e.g., caches, branch prediction, out-of-order (OoO) execution superscalar processing, and simultaneous multithreading (SMT). In this paper, we propose the use of a polymorphic VLIW processor to increase performance for non-critical tasks while maintaining analyzability. The processor achieves these goals by dynamically distributing computing resources (in the form of datapaths) to one or multiple threads. A static schedule guarantees the minimum amount of cycles to meet the deadlines for critical tasks. Datapaths that are not used by critical tasks can be assigned to non-critical tasks in a highly flexible way, thereby increasing resource utilization resulting in higher throughput. Our experiments show that our approach can exploit its dynamic properties to improve schedulability and assign up to 50% and on average 25% more resources to lower-priority threads during the execution of a static real-time schedule.