Hardware acceleration of BWA-MEM genomic short read mapping for longer read lengths

None, None; None, None; None, None; None, None

Hardware acceleration of BWA-MEM genomic short read mapping for longer read lengths

Journal Article (2018)

Author(s)

Ernst Joachim Houtgast (Bluebee, Rijswijk, TU Delft - Electrical Engineering, Mathematics and Computer Science)

Vlad-Mihai Sima (Bluebee, Rijswijk)

Koen Bertels (TU Delft - Electrical Engineering, Mathematics and Computer Science, TU Delft - FTQC/Bertels Lab)

Zaid Al-Ars (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Research Group

Computer Engineering

FPGA GPU Acceleration BWA-MEM Systolic array Short read mapping

DOI related publication

https://doi.org/10.1016/j.compbiolchem.2018.03.024 Final published version

To reference this document use

https://resolver.tudelft.nl/uuid:a533e35f-18e7-4a11-af1d-c1dae5235e29

More Info

expand_more

Publication Year

2018

Language

English

Research Group

Computer Engineering

Volume number

75

Pages (from-to)

54-64

Downloads counter

402

Collections

Institutional Repository

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

We present our work on hardware accelerated genomics pipelines, using either FPGAs or GPUs to accelerate execution of BWA-MEM, a widely-used algorithm for genomic short read mapping. The mapping stage can take up to 40% of overall processing time for genomics pipelines. Our implementation offloads the Seed Extension function, one of the main BWA-MEM computational functions, onto an accelerator. Sequencers typically output reads with a length of 150 base pairs. However, read length is expected to increase in the near future. Here, we investigate the influence of read length on BWA-MEM performance using data sets with read length up to 400 base pairs, and introduce methods to ameliorate the impact of longer read length. For the industry-standard 150 base pair read length, our implementation achieves an up to two-fold increase in overall application-level performance for systems with at most twenty-two logical CPU cores. Longer read length requires commensurately bigger data structures, which directly impacts accelerator efficiency. The two-fold performance increase is sustained for read length of at most 250 base pairs. To improve performance, we perform a classification of the inefficiency of the underlying systolic array architecture. By eliminating idle regions as much as possible, efficiency is improved by up to +95%. Moreover, adaptive load balancing intelligently distributes work between host and accelerator to ensure use of an accelerator always results in performance improvement, which in GPU-constrained scenarios provides up to +45% more performance.

Files

Postprint_paper.pdf

(pdf | 0.84 Mb)

- Embargo expired in 07-05-2020