E.J. Houtgast
Please Note
11 records found
1
On Hardware-Accelerated Maximally-Efficient Systolic Arrays
Acceleration and Optimization of Genomics Pipelines Through Hardware/Software Co-Design
The BWA-MEM genomic mapping algorithm is a critical first step of many genomics pipelines, as it maps the raw input sequences onto a reference genome, thereby reconstructing the sample's original genetic assembly. A major part of overall BWA-MEM execution time is spent performing Seed Extension, an algorithm closely related to the Smith-Waterman pairwise sequence alignment algorithm. The standard approach for the heterogeneous acceleration of the Smith-Waterman algorithm is to map it onto a systolic array architecture to compute elements of the similarity matrix in parallel. In order for systolic arrays to operate at high efficiency, they require long sequences to be aligned to one another. The BWA-MEM algorithm, in contrast, typically generates very short sequences that then require pairwise alignment through the Seed Extension algorithm. Therefore, in this dissertation, various techniques to improve the efficiency of systolic arrays for short sequence lengths are proposed.
The Variable Logical Length, the Variable Physical Length, and the Variable Logical and Physical Length systolic array architectures are proposed to eliminate the dependence of systolic array efficiency on read sequence length. To eliminate its dependence on reference sequence length, a streaming, implicit synchronizing architecture is introduced. Together, these techniques result in a maximally-efficient systolic array. A Seed Extension kernel has been implemented on both FPGA and GPU with a threefold kernel-level improvement to execution time, resulting in the first FPGA-accelerated and the first GPU-accelerated implementation of BWA-MEM with an overall end-to-end twofold application-level speedup. Moreover, a Smith-Waterman implementation has been developed on FPGA using the above efficiency improvements to the systolic array architecture, resulting in an implementation that has a performance of 214 GCUPS and that is able to attain 99.8% efficiency, which is the highest reported efficiency and performance of any FPGA-accelerated Smith-Waterman implementation to date. Finally, various aspects of these designs are evaluated, including power-efficiency and design-time. ...
The BWA-MEM genomic mapping algorithm is a critical first step of many genomics pipelines, as it maps the raw input sequences onto a reference genome, thereby reconstructing the sample's original genetic assembly. A major part of overall BWA-MEM execution time is spent performing Seed Extension, an algorithm closely related to the Smith-Waterman pairwise sequence alignment algorithm. The standard approach for the heterogeneous acceleration of the Smith-Waterman algorithm is to map it onto a systolic array architecture to compute elements of the similarity matrix in parallel. In order for systolic arrays to operate at high efficiency, they require long sequences to be aligned to one another. The BWA-MEM algorithm, in contrast, typically generates very short sequences that then require pairwise alignment through the Seed Extension algorithm. Therefore, in this dissertation, various techniques to improve the efficiency of systolic arrays for short sequence lengths are proposed.
The Variable Logical Length, the Variable Physical Length, and the Variable Logical and Physical Length systolic array architectures are proposed to eliminate the dependence of systolic array efficiency on read sequence length. To eliminate its dependence on reference sequence length, a streaming, implicit synchronizing architecture is introduced. Together, these techniques result in a maximally-efficient systolic array. A Seed Extension kernel has been implemented on both FPGA and GPU with a threefold kernel-level improvement to execution time, resulting in the first FPGA-accelerated and the first GPU-accelerated implementation of BWA-MEM with an overall end-to-end twofold application-level speedup. Moreover, a Smith-Waterman implementation has been developed on FPGA using the above efficiency improvements to the systolic array architecture, resulting in an implementation that has a performance of 214 GCUPS and that is able to attain 99.8% efficiency, which is the highest reported efficiency and performance of any FPGA-accelerated Smith-Waterman implementation to date. Finally, various aspects of these designs are evaluated, including power-efficiency and design-time.
We present our work on hardware accelerated genomics pipelines, using either FPGAs or GPUs to accelerate execution of BWA-MEM, a widely-used algorithm for genomic short read mapping. The mapping stage can take up to 40% of overall processing time for genomics pipelines. Our implementation offloads the Seed Extension function, one of the main BWA-MEM computational functions, onto an accelerator. Sequencers typically output reads with a length of 150 base pairs. However, read length is expected to increase in the near future. Here, we investigate the influence of read length on BWA-MEM performance using data sets with read length up to 400 base pairs, and introduce methods to ameliorate the impact of longer read length. For the industry-standard 150 base pair read length, our implementation achieves an up to two-fold increase in overall application-level performance for systems with at most twenty-two logical CPU cores. Longer read length requires commensurately bigger data structures, which directly impacts accelerator efficiency. The two-fold performance increase is sustained for read length of at most 250 base pairs. To improve performance, we perform a classification of the inefficiency of the underlying systolic array architecture. By eliminating idle regions as much as possible, efficiency is improved by up to +95%. Moreover, adaptive load balancing intelligently distributes work between host and accelerator to ensure use of an accelerator always results in performance improvement, which in GPU-constrained scenarios provides up to +45% more performance.
Comparative Analysis of System-Level Acceleration Techniques in Bioinformatics
A Case Study of Accelerating the Smith-Waterman Algorithm for BWA-MEM
Our implementation is both faster and more efficient than other current Smith-Waterman implementations, obtaining a theoretical performance of 214 GCUPS. Moreover, due to the streaming, implicit synchronizing nature of our implementation, which streams alignments and places no restrictions on
the number of alignments in flight, it achieves 99.8% of this performance in practice, almost three times as fast as previous implementations. The expressiveness of OpenCL results in a significant reduction in lines of code, and in a significant reduction of development time compared to programming in
regular hardware description languages. ...
Our implementation is both faster and more efficient than other current Smith-Waterman implementations, obtaining a theoretical performance of 214 GCUPS. Moreover, due to the streaming, implicit synchronizing nature of our implementation, which streams alignments and places no restrictions on
the number of alignments in flight, it achieves 99.8% of this performance in practice, almost three times as fast as previous implementations. The expressiveness of OpenCL results in a significant reduction in lines of code, and in a significant reduction of development time compared to programming in
regular hardware description languages.
generation kernel is able to achieve a speedup of 1.7x. This enables a total application acceleration of 2.6x compared to the original software version. ...
generation kernel is able to achieve a speedup of 1.7x. This enables a total application acceleration of 2.6x compared to the original software version.
against a software-only baseline system. By offloading the Seed Extension phase onto the FPGA, a two-fold speedup in overall application-level performance is achieved and a 1.6x gain in power-efficiency. To facilitate platform and tool-agnostic comparisons, the base pairs per Joule unit is introduced as a measure of power-efficiency. The FPGA design is able to map up to 34 thousand base pairs per Joule. ...
against a software-only baseline system. By offloading the Seed Extension phase onto the FPGA, a two-fold speedup in overall application-level performance is achieved and a 1.6x gain in power-efficiency. To facilitate platform and tool-agnostic comparisons, the base pairs per Joule unit is introduced as a measure of power-efficiency. The FPGA design is able to map up to 34 thousand base pairs per Joule.
The GPU-based Extend kernel achieves three times higher performance and, by offloading the kernel onto an accelerator and overlapping its execution with the other functions, this results in an overall improvement to application-level execution time of up to 1.6x.
To ensure that using an accelerator always results in an overall performance improvement, especially when considering slower GPUs, an adaptive load balancing solution is introduced, which intelligently distributes work between host and GPU. This provides, compared to not using load balancing, up to +46 % more performance.
...
The GPU-based Extend kernel achieves three times higher performance and, by offloading the kernel onto an accelerator and overlapping its execution with the other functions, this results in an overall improvement to application-level execution time of up to 1.6x.
To ensure that using an accelerator always results in an overall performance improvement, especially when considering slower GPUs, an adaptive load balancing solution is introduced, which intelligently distributes work between host and GPU. This provides, compared to not using load balancing, up to +46 % more performance.
...