R. Nane
Please Note
11 records found
1
Scientific computing is at the core of many High-Performance Computing applications, including computational flow dynamics. Because of the utmost importance to simulate increasingly larger computational models, hardware acceleration is receiving increased attention due to its potential to maximize the performance of scientific computing. Field-Programmable Gate Arrays could accelerate scientific computing because of the possibility to fully customize the memory hierarchy important in irregular applications such as iterative linear solvers. In this article, we study the potential of using Field-Programmable Gate Arrays in High-Performance Computing because of the rapid advances in reconfigurable hardware, such as the increase in on-chip memory size, increasing number of logic cells, and the integration of High-Bandwidth Memories on board. To perform this study, we propose a novel Sparse Matrix-Vector multiplication unit and an ILU0 preconditioner tightly integrated with a BiCGStab solver kernel. We integrate the developed preconditioned iterative solver in Flow from the Open Porous Media project, a state-of-the-art open source reservoir simulator. Finally, we perform a thorough evaluation of the FPGA solver kernel in both stand-alone mode and integrated in the reservoir simulator, using the NORNE field, a real-world case reservoir model using a grid with more than 105 cells and using three unknowns per cell.
OpenQL
A Portable Quantum Programming Framework for Quantum Accelerators
With the potential of quantum algorithms to solve intractable classical problems, quantum computing is rapidly evolving, and more algorithms are being developed and optimized. Expressing these quantum algorithms using a high-level language and making them executable on a quantum processor while abstracting away hardware details is a challenging task. First, a quantum programming language should provide an intuitive programming interface to describe those algorithms. Then a compiler has to transform the program into a quantum circuit, optimize it, and map it to the target quantum processor respecting the hardware constraints such as the supported quantum operations, the qubit connectivity, and the control electronics limitations. In this article, we propose a quantum programming framework named OpenQL, which includes a high-level quantum programming language and its associated quantum compiler. We present the programming interface of OpenQL, we describe the different layers of the compiler and how we can provide portability over different qubit technologies. Our experiments show that OpenQL allows the execution of the same high-level algorithm on two different qubit technologies, namely superconducting qubits and Si-Spin qubits. Besides the executable code, OpenQL also produces an intermediate quantum assembly code, which is technology independent and can be simulated using the QX simulator.
Memristor-based Computation-in-Memory (CIM) is one of the emerging architectures for next-generation Big Data problems. Its design requires a radically new synthesis flow as the memristor is a passive device that uses resistances to encode its logic values. This article proposes a synthesis flow for mapping parallel applications on memristor-based CIM architecture. First, it employs solution templates that contain scheduling, placement, and routing information to map multiple algorithms with similar data flow graphs to the memristor crossbar; this template is named skeleton. Complex algorithms that do not fit a single skeleton can be solved by nested skeletons. Therefore, this approach can be applied to a wide range of applications while using a limited number of skeletons only. Second, it further improves the design when spatial and temporal patterns exist in input data. To accelerate simulation of generated SystemC models, we integrate MPI in skeletons. The synthesis flow and its additional features are verified with multiple applications, and the results are compared against a multicore platform. These experiments demonstrate the feasibility and the potential of this approach.
Sparstition
A partitioning scheme for large-scale sparse matrix vector multiplication on FPGA
Sparse Matrix Vector Multiplication (SpMV) is a key kernel in various domains, that is known to be difficult to parallelize efficiently due to the low spatial locality of data. This is problematic for computing large-scale SpMV due to limited cache sizes but also in achieving speedups through parallel execution. To address these issues, we present 1) sparstition, a novel partitioning scheme that enables computing SpMV without the need to do any major post-processing steps, and 2) a corresponding HLS-based hardware design that is able to perform large-scale SpMV efficiently. The design is pipelined so the matrix size is limited only by the size of the off-chip memory (DRAM) and not by the available on-chip memory (BRAMs). Our experimental results, performed on a ZedBoard, show that we achieve a computational throughput of up to 300 MFLOPS in single-precision and 108 MFLOPS in double-precision, an improvement of 2.6X on average compared to current state-of-the-art HLS results. Finally, we predict that sparstition can boost the computational throughput of HLS-based SpMV kernel to over 10 GFLOPS when using High Bandwidth Memories.
access may exceed 80%. Recently, the concept of Computation-in-Memory (CIM) was proposed, which is based on the integration of storage and computation in the same physical location using a crossbar topology and non-volatile resistive-switching memristor technology. To illustrate the tremendous potential of CIM architecture in exploiting massively parallel computation while reducing the communication overhead, we present a communicationefficient mapping of a large-scale matrix multiplication algorithm on the CIM architecture. The experimental results show that, depending on the matrix size, CIM architecture exhibits several orders of magnitude higher performance in total execution time
and two orders of magnitude better in total energy consumption than the multicore-based on the shared memory architecture. ...
access may exceed 80%. Recently, the concept of Computation-in-Memory (CIM) was proposed, which is based on the integration of storage and computation in the same physical location using a crossbar topology and non-volatile resistive-switching memristor technology. To illustrate the tremendous potential of CIM architecture in exploiting massively parallel computation while reducing the communication overhead, we present a communicationefficient mapping of a large-scale matrix multiplication algorithm on the CIM architecture. The experimental results show that, depending on the matrix size, CIM architecture exhibits several orders of magnitude higher performance in total execution time
and two orders of magnitude better in total energy consumption than the multicore-based on the shared memory architecture.
detection algorithms, available solutions need expensive hardware and/or additional ASICs, which are not suitable for low-cost commodity hardware devices. In this paper, we propose a low-cost and low-power embedded solution to provide high speed depth detection. We extend an existing off-the-shelf VLIW
image processor and perform algorithmic and architectural optimizations in order to achieve the requested real-time performance speed. Experimental results show that by adding different functional units and adjusting the algorithm to take full advantage of them, a 640x480 image pair with 64 disparities1 can be processed at 36.75 fps on a single processor instance, which
is an improvement of 23% compared to the best state-of-the-art image processor. ...
detection algorithms, available solutions need expensive hardware and/or additional ASICs, which are not suitable for low-cost commodity hardware devices. In this paper, we propose a low-cost and low-power embedded solution to provide high speed depth detection. We extend an existing off-the-shelf VLIW
image processor and perform algorithmic and architectural optimizations in order to achieve the requested real-time performance speed. Experimental results show that by adding different functional units and adjusting the algorithm to take full advantage of them, a 640x480 image pair with 64 disparities1 can be processed at 36.75 fps on a single processor instance, which
is an improvement of 23% compared to the best state-of-the-art image processor.