E.J. Houtgast | TU Delft Repository

On Hardware-Accelerated Maximally-Efficient Systolic Arrays

Acceleration and Optimization of Genomics Pipelines Through Hardware/Software Co-Design

Doctoral thesis (2019) - Ernst Houtgast, Zaid Al-Ars, Koen Bertels

Developments in sequencing technology have drastically reduced the cost of DNA sequencing. The raw sequencing data being generated requires processing through computationally demanding suites of bioinformatics algorithms called genomics pipelines. The greatly decreased cost of sequencing has resulted in its widespread adoption, and the amount of data that is being generated is increasing exponentially, projected to soon rival big data fields such as astronomy. Therefore, acceleration and optimization of such genomics pipelines is becoming increasingly important.

The BWA-MEM genomic mapping algorithm is a critical first step of many genomics pipelines, as it maps the raw input sequences onto a reference genome, thereby reconstructing the sample's original genetic assembly. A major part of overall BWA-MEM execution time is spent performing Seed Extension, an algorithm closely related to the Smith-Waterman pairwise sequence alignment algorithm. The standard approach for the heterogeneous acceleration of the Smith-Waterman algorithm is to map it onto a systolic array architecture to compute elements of the similarity matrix in parallel. In order for systolic arrays to operate at high efficiency, they require long sequences to be aligned to one another. The BWA-MEM algorithm, in contrast, typically generates very short sequences that then require pairwise alignment through the Seed Extension algorithm. Therefore, in this dissertation, various techniques to improve the efficiency of systolic arrays for short sequence lengths are proposed.

The Variable Logical Length, the Variable Physical Length, and the Variable Logical and Physical Length systolic array architectures are proposed to eliminate the dependence of systolic array efficiency on read sequence length. To eliminate its dependence on reference sequence length, a streaming, implicit synchronizing architecture is introduced. Together, these techniques result in a maximally-efficient systolic array. A Seed Extension kernel has been implemented on both FPGA and GPU with a threefold kernel-level improvement to execution time, resulting in the first FPGA-accelerated and the first GPU-accelerated implementation of BWA-MEM with an overall end-to-end twofold application-level speedup. Moreover, a Smith-Waterman implementation has been developed on FPGA using the above efficiency improvements to the systolic array architecture, resulting in an implementation that has a performance of 214 GCUPS and that is able to attain 99.8% efficiency, which is the highest reported efficiency and performance of any FPGA-accelerated Smith-Waterman implementation to date. Finally, various aspects of these designs are evaluated, including power-efficiency and design-time. ...

Developments in sequencing technology have drastically reduced the cost of DNA sequencing. The raw sequencing data being generated requires processing through computationally demanding suites of bioinformatics algorithms called genomics pipelines. The greatly decreased cost of sequencing has resulted in its widespread adoption, and the amount of data that is being generated is increasing exponentially, projected to soon rival big data fields such as astronomy. Therefore, acceleration and optimization of such genomics pipelines is becoming increasingly important.

The BWA-MEM genomic mapping algorithm is a critical first step of many genomics pipelines, as it maps the raw input sequences onto a reference genome, thereby reconstructing the sample's original genetic assembly. A major part of overall BWA-MEM execution time is spent performing Seed Extension, an algorithm closely related to the Smith-Waterman pairwise sequence alignment algorithm. The standard approach for the heterogeneous acceleration of the Smith-Waterman algorithm is to map it onto a systolic array architecture to compute elements of the similarity matrix in parallel. In order for systolic arrays to operate at high efficiency, they require long sequences to be aligned to one another. The BWA-MEM algorithm, in contrast, typically generates very short sequences that then require pairwise alignment through the Seed Extension algorithm. Therefore, in this dissertation, various techniques to improve the efficiency of systolic arrays for short sequence lengths are proposed.

The Variable Logical Length, the Variable Physical Length, and the Variable Logical and Physical Length systolic array architectures are proposed to eliminate the dependence of systolic array efficiency on read sequence length. To eliminate its dependence on reference sequence length, a streaming, implicit synchronizing architecture is introduced. Together, these techniques result in a maximally-efficient systolic array. A Seed Extension kernel has been implemented on both FPGA and GPU with a threefold kernel-level improvement to execution time, resulting in the first FPGA-accelerated and the first GPU-accelerated implementation of BWA-MEM with an overall end-to-end twofold application-level speedup. Moreover, a Smith-Waterman implementation has been developed on FPGA using the above efficiency improvements to the systolic array architecture, resulting in an implementation that has a performance of 214 GCUPS and that is able to attain 99.8% efficiency, which is the highest reported efficiency and performance of any FPGA-accelerated Smith-Waterman implementation to date. Finally, various aspects of these designs are evaluated, including power-efficiency and design-time.

Hardware acceleration of BWA-MEM genomic short read mapping for longer read lengths

Journal article (2018) - Ernst Joachim Houtgast, Vlad-Mihai Sima, Koen Bertels, Zaid Al-Ars

We present our work on hardware accelerated genomics pipelines, using either FPGAs or GPUs to accelerate execution of BWA-MEM, a widely-used algorithm for genomic short read mapping. The mapping stage can take up to 40% of overall processing time for genomics pipelines. Our implementation offloads the Seed Extension function, one of the main BWA-MEM computational functions, onto an accelerator. Sequencers typically output reads with a length of 150 base pairs. However, read length is expected to increase in the near future. Here, we investigate the influence of read length on BWA-MEM performance using data sets with read length up to 400 base pairs, and introduce methods to ameliorate the impact of longer read length. For the industry-standard 150 base pair read length, our implementation achieves an up to two-fold increase in overall application-level performance for systems with at most twenty-two logical CPU cores. Longer read length requires commensurately bigger data structures, which directly impacts accelerator efficiency. The two-fold performance increase is sustained for read length of at most 250 base pairs. To improve performance, we perform a classification of the inefficiency of the underlying systolic array architecture. By eliminating idle regions as much as possible, efficiency is improved by up to +95%. Moreover, adaptive load balancing intelligently distributes work between host and accelerator to ensure use of an accelerator always results in performance improvement, which in GPU-constrained scenarios provides up to +45% more performance. ...

Comparative Analysis of System-Level Acceleration Techniques in Bioinformatics

A Case Study of Accelerating the Smith-Waterman Algorithm for BWA-MEM

Conference paper (2018) - Ernst Houtgast, Vlad Sima, Koen Bertels, Zaid Al-Ars

Bioinformatics workloads are characterized by huge data sets and complex algorithms, requiring enormous data processing and making high performance heterogeneous computation platforms such as FPGAs and GPUs highly relevant. We compare three accelerated implementations of the widely used BWA-MEM genomic mapping tool as a case study on design-time optimization for heterogeneous architectures: BWA-MEM-CUDA, BWA-MEM-OpenCL, and BWA-MEMVHDL, each using an optimized Smith-Waterman algorithm implementation. Optimization of design-time is important because of the significant development effort of such implementations: BWA-MEM-CUDA and BWA-MEM-OpenCL require 5-7x more lines of code to express the Smith-Waterman algorithm, while BWA-MEM-VHDL requires more than 40x as many lines of code. Similar differences hold for required implementation time, ranging from one month for BWA-MEMOpenCL to six months for BWA-MEM-VHDL. The advantages and disadvantages of each implementation are described using both quantitative and qualitative metrics, and recommendations are given for future algorithm implementations. ...

High Performance Streaming Smith-Waterman Implementation with Implicit Synchronization on Intel FPGA using OpenCL

Conference paper (2017) - Ernst Houtgast, Vlad Sima, Zaid Al-Ars

The Smith-Waterman algorithm is widely used in bioinformatics and is often used as a benchmark of FPGA performance. Here we present our highly optimized Smith- Waterman implementation on Intel FPGAs using OpenCL.
Our implementation is both faster and more efficient than other current Smith-Waterman implementations, obtaining a theoretical performance of 214 GCUPS. Moreover, due to the streaming, implicit synchronizing nature of our implementation, which streams alignments and places no restrictions on
the number of alignments in flight, it achieves 99.8% of this performance in practice, almost three times as fast as previous implementations. The expressiveness of OpenCL results in a significant reduction in lines of code, and in a significant reduction of development time compared to programming in
regular hardware description languages. ...

Power-Efficiency Analysis of Accelerated BWA-MEM Implementations on Heterogeneous Computing Platforms

Conference paper (2016) - Ernst Joachim Houtgast, Vlad-Mihai Sima, Giacomo Marchiori, Koen Bertels, Zaid Al-Ars

Next Generation Sequencing techniques have dramatically reduced the cost of sequencing genetic material, resulting in huge amounts of data being sequenced. The processing of this data poses huge challenges, both from a performance perspective, as well as from a power-efficiency perspective. Heterogeneous computing can help on both fronts, by enabling more performant and more power-efficient solutions. In this paper, power-efficiency of the BWA-MEM algorithm, a popular tool for genomic data mapping, is studied on two heterogeneous architectures. The performance and power-efficiency of an FPGA-based implementation using a single Xilinx Virtex-7 FPGA on the Alpha Data add-in card is compared to a GPU-based implementation using an NVIDIA GeForce GTX 970 and against the software-only baseline system. By offloading the Seed Extension phase on an accelerator, both implementations are able to achieve a two-fold speedup in overall application-level performance over the software-only implementation. Moreover, the highly customizable nature of the FPGA results in much higher power-efficiency, as the FPGA power consumption is less than one fourth of that of the GPU. To facilitate platform and tool-agnostic comparisons, the base pairs per Joule unit is introduced as a measure of power-efficiency. The FPGA design is able to map up to 44 thousand base pairs per Joule, a 2.1x gain in power-efficiency as compared to the software-only baseline. ...

Heterogeneous hardware/software acceleration of the BWA-MEM DNA alignment algorithm

Conference paper (2016) - Nauman Ahmed, VM Sima, Ernst Houtgast, Koen Bertels, Z Al-Ars

The fast decrease in cost of DNA sequencing has resulted in an enormous growth in available genome data, and hence led to an increasing demand for fast DNA analysis algorithms used for diagnostics of genetic disorders, such as cancer. One of the most computationally intensive steps in the analysis is represented by the DNA read alignment. In this paper, we present an accelerated version of BWA-MEM, one of the most popular read alignment algorithms, by implementing a heterogeneous hardware/software optimized version on the Convey HC2ex platform. A challenging factor of the BWAMEM algorithm is the fact that it consists of not one, but three computationally intensive kernels: SMEM generation, suffix array lookup and local Smith-Waterman. Obtaining substantial speedup is hence contingent on accelerating all of these three kernels at once. The paper shows an architecture containing two hardware-accelerated kernels and one kernel optimized in software. The two hardware kernels of suffix array lookup and local Smith-Waterman are able to reach speedups of 2.8x and 5.7x, respectively. The software optimization of the SMEM
generation kernel is able to achieve a speedup of 1.7x. This enables a total application acceleration of 2.6x compared to the original software version. ...

Computational Challenges of Next Generation Sequencing Pipelines Using Heterogeneous Systems

Abstract (2016) - Ernst Houtgast, Vlad Sima, Koen Bertels, Zaid Al-Ars

We are rapidly entering the era of genomics. The dramatic cost reduction of DNA sequencing due to the introduction of Next Generation Sequencing (NGS) techniques has resulted in an exponential growth of genetics data. The amount of data generated, and its associated processing into useful information, poses serious computational challenges. Here, we give a brief introduction of NGS, show a typical NGS processing pipeline, and show the associated challenges from a computational perspective. A case study is presented where one component of the NGS processing pipeline is accelerated: BWA-MEM, the de-facto industry-standard for the mapping stage. This is a first step in achieving a fully heterogeneously accelerated NGS pipeline. ...

Power-Efficient Accelerated Genomic Short Read Mapping on Heterogeneous Computing Platforms

Abstract (2016) - Ernst Houtgast, Vlad Sima, Giacomo Marchiori, Koen Bertels, Zaid Al-Ars

We propose a novel FPGA-accelerated BWA-MEM implementation, a popular tool for genomic data mapping. The performance and power-efficiency of the FPGA implementation on the single Xilinx Virtex-7 Alpha Data add-in card is compared
against a software-only baseline system. By offloading the Seed Extension phase onto the FPGA, a two-fold speedup in overall application-level performance is achieved and a 1.6x gain in power-efficiency. To facilitate platform and tool-agnostic comparisons, the base pairs per Joule unit is introduced as a measure of power-efficiency. The FPGA design is able to map up to 34 thousand base pairs per Joule. ...

An Efficient GPU-Accelerated Implementation of Genomic Short Read Mapping with BWA-MEM

Journal article (2016) - Ernst Houtgast, Vlad Sima, Koen Bertels, Zaid Al-Ars

Next Generation Sequencing techniques have resulted in an exponential growth in the generation of genetics data, the amount of which will soon rival, if not overtake, other Big Data elds, such as astronomy and streaming video services. To become useful, this data requires processing by a complex pipeline of algorithms, taking multiple days even on large clusters. The mapping stage of such genomics pipelines, which maps the short reads onto a reference genome, takes up a signicant portion of execution time. BWA-MEM is the de-facto industry-standard for the mapping stage. Here, a GPU-accelerated implementation of BWA-MEM is proposed. The Seed Extension phase, one of the three main BWA-MEM algorithm phases that requires between 30%-50% of overall processing time, is ooaded onto the GPU. A thorough design space analysis is presented for an optimized mapping of this phase onto the GPU. The re-sulting systolic-array based implementation obtains a two-fold overall application-level speedup, which is the maximum theoretically achievable speedup. Moreover, this speedup is sustained for systems with up to twenty-two logical cores. Based on the ndings, a number of suggestions are made to improve GPU architecture, resulting in potentially greatly increased performance for bioinformatics-class algorithms. ...

GPU-Accelerated BWA-MEM Genomic Mapping Algorithm Using Adaptive Load Balancing

Conference paper (2016) - Ernst Houtgast, Vlad Sima, Koen Bertels, Zaid Al-Ars

Genomic sequencing is rapidly becoming a premier generator of Big Data, posing great computational challenges. Hence, acceleration of the algorithms used is of utmost importance. This paper presents a GPU-accelerated implementation of BWA-MEM, a widely used algorithm to map genomic sequences onto a reference genome. BWA-MEM contains three main computational functions: Seed Generation, Seed Extension and Output Generation. This paper discusses acceleration of the Seed Extension function on a GPU accelerator.
The GPU-based Extend kernel achieves three times higher performance and, by offloading the kernel onto an accelerator and overlapping its execution with the other functions, this results in an overall improvement to application-level execution time of up to 1.6x.
To ensure that using an accelerator always results in an overall performance improvement, especially when considering slower GPUs, an adaptive load balancing solution is introduced, which intelligently distributes work between host and GPU. This provides, compared to not using load balancing, up to +46 % more performance.
...

An FPGA-based systolic array to accelerate the BWA-MEM genomic mapping algorithm

Conference paper (2015) - Ernst Houtgast, VM Sima, K Bertels, Z Al-Ars

We present the first accelerated implementation of BWA-MEM, a popular genome sequence alignment algorithm widely used in next generation sequencing genomics pipelines. The Smith-Waterman-like sequence alignment kernel requires a significant portion of overall execution time. We propose and evaluate a number of FPGA-based systolic array architectures, presenting optimizations generally applicable to variable length Smith-Waterman execution. Our kernel implementation is up to 3x faster, compared to software-only execution. This translates into an overall application speedup of up to 45%, which is 96% of the theoretically maximum achievable speedup when accelerating only this kernel.
...