N. Ahmed | TU Delft Repository

Comparison of cloud-to-cloud distance calculation methods for change detection in spatio-temporal point clouds

Abstract (2024) - Vitali Diaz, Peter van Oosterom, Martijn Meijers, Edward Verbree, Nauman Ahmed, Thijs van Lankveld

The advantages of using point clouds for change detection analysis include comprehensive spatial and temporal representation, as well as high precision and accuracy in the calculations. These benefits make point clouds a powerful data type for spatio-temporal analysis. Nevertheless, most current change detection methods have been specifically designed and utilized for raster data. This research aims to identify the most suitable cloud-to-cloud (c2c) distance calculation algorithm for further implementation in change detection for spatio-temporal point clouds. Eight different methods, varying in complexity and execution time, are compared without converting the point cloud data into rasters. Hourly point cloud data from monitoring a beach-dune system's dynamics is used to carry out the comparison. The c2c distance methods are (1) the nearest neighbor, (2) least squares plane, (3) linear interpolation, (4) quadratic (height function), (5) 2.5D triangulation, (6) natural neighbor interpolation (NNI), (7) inverse distance weight (IDW) and (8) multiscale model to model cloud comparison (M3C2). We evaluate these algorithms, considering both the accuracy of the calculated distance and the execution time. The results can be valuable for analyzing and monitoring the (build) environment with spatio-temporal point cloud data. ...

Comparison of Cloud-to-Cloud Distance Calculation Methods

Is the Most Complex Always the Most Suitable?

Conference paper (2024) - Vitali Diaz, Peter van Oosterom, B.M. Meijers, Edward Verbree, Nauman Ahmed, Thijs van Lankveld

Cloud-to-cloud (C2C) distance calculations are frequently performed as an initial stage in change detection and spatiotemporal analysis with point clouds. There are various methods for calculating C2C distance, also called inter-point distance, which refers to the distance between two corresponding point clouds captured at different epochs. These methods can be classified from simple to complex, with more steps and calculations required for the latter. Generally, it is assumed that a more complex method will result in a more precise calculation of inter-point distance, but this assumption is rarely evaluated. This paper compares eight commonly used methods for calculating the inter-point distance. The results indicate that the accuracy of distance calculations depends on the chosen method and a characteristic related to the point density, the intra-point distance, which refers to the distance between points within the same point cloud. The results are helpful for applications that analyze spatiotemporal point clouds for change detection. The findings will be helpful in future applications, including analyzing spatiotemporal point clouds for change detection. ...

SALoBa

Maximizing Data Locality and Workload Balance for Fast Sequence Alignment on GPUs

Conference paper (2022) - Seongyeon Park, Hajin Kim, Tanveer Ahmad, Nauman Ahmed, Zaid Al-Ars, Peter Hofstee, Youngsok Kim, Jinho Lee

Sequence alignment forms an important backbone in many sequencing applications. A commonly used strategy for sequence alignment is an approximate string matching with a two-dimensional dynamic programming approach. Although some prior work has been conducted on GPU acceleration of a sequence alignment, we identify several shortcomings that limit exploiting the full computational capability of modern GPUs. This paper presents SALoBa, a GPU-accelerated sequence alignment library focused on seed extension. Based on the analysis of previous work with real-world sequencing data, we propose techniques to exploit the data locality and improve work-load balancing. The experimental results reveal that SALoBa significantly improves the seed extension kernel compared to state-of-the-art GPU-based methods. ...

Optimizing performance of GATK workflows using Apache Arrow In-Memory data framework

Journal article (2020) - Tanveer Ahmad, Nauman Ahmed, Zaid Al-Ars, H. Peter Hofstee

Background: Immense improvements in sequencing technologies enable producing large amounts of high throughput and cost effective next-generation sequencing (NGS) data. This data needs to be processed efficiently for further downstream analyses. Computing systems need this large amounts of data closer to the processor (with low latency) for fast and efficient processing. However, existing workflows depend heavily on disk storage and access, to process this data incurs huge disk I/O overheads. Previously, due to the cost, volatility and other physical constraints of DRAM memory, it was not feasible to place large amounts of working data sets in memory. However, recent developments in storage-class memory and non-volatile memory technologies have enabled computing systems to place huge data in memory to process it directly from memory to avoid disk I/O bottlenecks. To exploit the benefits of such memory systems efficiently, proper formatted data placement in memory and its high throughput access is necessary by avoiding (de)-serialization and copy overheads in between processes. For this purpose, we use the newly developed Apache Arrow, a cross-language development framework that provides language-independent columnar in-memory data format for efficient in-memory big data analytics. This allows genomics applications developed in different programming languages to communicate in-memory without having to access disk storage and avoiding (de)-serialization and copy overheads. Implementation: We integrate Apache Arrow in-memory based Sequence Alignment/Map (SAM) format and its shared memory objects store library in widely used genomics high throughput data processing applications like BWA-MEM, Picard and GATK to allow in-memory communication between these applications. In addition, this also allows us to exploit the cache locality of tabular data and parallel processing capabilities through shared memory objects. Results: Our implementation shows that adopting in-memory SAM representation in genomics high throughput data processing applications results in better system resource utilization, low number of memory accesses due to high cache locality exploitation and parallel scalability due to shared memory objects. Our implementation focuses on the GATK best practices recommended workflows for germline analysis on whole genome sequencing (WGS) and whole exome sequencing (WES) data sets. We compare a number of existing in-memory data placing and sharing techniques like ramDisk and Unix pipes to show how columnar in-memory data representation outperforms both. We achieve a speedup of 4.85x and 4.76x for WGS and WES data, respectively, in overall execution time of variant calling workflows. Similarly, a speedup of 1.45x and 1.27x for these data sets, respectively, is achieved, as compared to the second fastest workflow. In some individual tools, particularly in sorting, duplicates removal and base quality score recalibration the speedup is even more promising. Availability: The code and scripts used in our experiments are available in both container and repository form at: https://github.com/abs-tudelft/ArrowSAM. ...

Background: Immense improvements in sequencing technologies enable producing large amounts of high throughput and cost effective next-generation sequencing (NGS) data. This data needs to be processed efficiently for further downstream analyses. Computing systems need this large amounts of data closer to the processor (with low latency) for fast and efficient processing. However, existing workflows depend heavily on disk storage and access, to process this data incurs huge disk I/O overheads. Previously, due to the cost, volatility and other physical constraints of DRAM memory, it was not feasible to place large amounts of working data sets in memory. However, recent developments in storage-class memory and non-volatile memory technologies have enabled computing systems to place huge data in memory to process it directly from memory to avoid disk I/O bottlenecks. To exploit the benefits of such memory systems efficiently, proper formatted data placement in memory and its high throughput access is necessary by avoiding (de)-serialization and copy overheads in between processes. For this purpose, we use the newly developed Apache Arrow, a cross-language development framework that provides language-independent columnar in-memory data format for efficient in-memory big data analytics. This allows genomics applications developed in different programming languages to communicate in-memory without having to access disk storage and avoiding (de)-serialization and copy overheads. Implementation: We integrate Apache Arrow in-memory based Sequence Alignment/Map (SAM) format and its shared memory objects store library in widely used genomics high throughput data processing applications like BWA-MEM, Picard and GATK to allow in-memory communication between these applications. In addition, this also allows us to exploit the cache locality of tabular data and parallel processing capabilities through shared memory objects. Results: Our implementation shows that adopting in-memory SAM representation in genomics high throughput data processing applications results in better system resource utilization, low number of memory accesses due to high cache locality exploitation and parallel scalability due to shared memory objects. Our implementation focuses on the GATK best practices recommended workflows for germline analysis on whole genome sequencing (WGS) and whole exome sequencing (WES) data sets. We compare a number of existing in-memory data placing and sharing techniques like ramDisk and Unix pipes to show how columnar in-memory data representation outperforms both. We achieve a speedup of 4.85x and 4.76x for WGS and WES data, respectively, in overall execution time of variant calling workflows. Similarly, a speedup of 1.45x and 1.27x for these data sets, respectively, is achieved, as compared to the second fastest workflow. In some individual tools, particularly in sorting, duplicates removal and base quality score recalibration the speedup is even more promising. Availability: The code and scripts used in our experiments are available in both container and repository form at: https://github.com/abs-tudelft/ArrowSAM.

GPU acceleration of Darwin read overlapper for de novo assembly of long DNA reads

Journal article (2020) - Nauma Ahmed, Tong Dong Qiu, Koen Bertels, Zaid Al-Ars

BACKGROUND: In Overlap-Layout-Consensus (OLC) based de novo assembly, all reads must be compared with every other read to find overlaps. This makes the process rather slow and limits the practicality of using de novo assembly methods at a large scale in the field. Darwin is a fast and accurate read overlapper that can be used for de novo assembly of state-of-the-art third generation long DNA reads. Darwin is designed to be hardware-friendly and can be accelerated on specialized computer system hardware to achieve higher performance. RESULTS: This work accelerates Darwin on GPUs. Using real Pacbio data, our GPU implementation on Tesla K40 has shown a speedup of 109x vs 8 CPU threads of an Intel Xeon machine and 24x vs 64 threads of IBM Power8 machine. The GPU implementation supports both linear and affine gap, scoring model. The results show that the GPU implementation can achieve the same high speedup for different scoring schemes. CONCLUSIONS: The GPU implementation proposed in this work shows significant improvement in performance compared to the CPU version, thereby making it accessible for utilization as a practical read overlapper in a DNA assembly pipeline. Furthermore, our GPU acceleration can also be used for performing fast Smith-Waterman alignment between long DNA reads. GPU hardware has become commonly available in the field today, making the proposed acceleration accessible to a larger public. The implementation is available at https://github.com/Tongdongq/darwin-gpu . ...

ArrowSAM

In-Memory Genomics Data Processing Using Apache Arrow

Conference paper (2020) - Tanveer Ahmad, Nauman Ahmed, Johan Peltenburg, Zaid Al-Ars

The rapidly growing size of genomics data bases, driven by advances in sequencing technologies, demands fast and cost-effective processing. However, processing this data creates many challenges, particularly in selecting appropriate algorithms and computing platforms. Computing systems need data closer to the processor for fast processing. Traditionally, due to cost, volatility and other physical constraints of DRAM, it was not feasible to place large amounts of working data sets in memory. However, new emerging storage class memories allow storing and processing big data closer to the processor. In this work, we show how the commonly used genomics data format, Sequence Alignment/Map (SAM), can be presented in the Apache Arrow in-memory data representation to benefit of in-memory processing and to ensure better scalability through shared memory objects, by avoiding large (de)-serialization overheads in cross-language interoperability. To demonstrate the benefits of such a system, we propose ArrowSAM, an in-memory SAM format that uses the Apache Arrow framework, and integrate it into genome pre-processing pipelines including BWA-MEM, Picard and Sambamba. Results show 15x and 2.4x speedups as compared to Picard and Sambamba, respectively. The code and scripts for running all workflows are freely available at https://github.com/abs-tudelft/ArrowSAM. ...

High Performance Seed-and-Extend Algorithms for Genomics

Doctoral thesis (2020) - Nauman Ahmed

Recent advances in DNA sequencing technology have opened new doors for scientists to use genomic data analysis in a variety of applications that directly affect human lives. However, the analysis of unprecedented volumes of sequencing data being produced represents a formidable computational challenge. The conventional CPU-only computing paradigm is not sufficient to analyze exponentially growing sequencing data in a cost-effective and timely manner. Heterogeneous computing systems with GPU and FPGA based accelerators have become easily accessible and are increasingly being used to process massive amounts of data due to their better performance-to-cost ratio than CPU-only platforms. Furthermore, highly optimized analysis algorithms are required to extract the maximum computational power of these computing systems. ...

Efficient GPU Acceleration for Computing Maximal Exact Matches in Long DNA Reads

Conference paper (2020) - Nauman Ahmed, Koen Bertels, Zaid Al-Ars

The seeding heuristic is widely used in many DNA analysis applications to speed up the analysis time. In many applications, seeding takes a substantial amount of the total execution time. In this paper, we present an efficient GPU implementation for computing maximal exact matching (MEM) seeds in long DNA reads. We applied various optimizations to reduce the number of GPU global memory accesses and to avoid redundant computation. Our implementation also extracts maximum parallelism from the MEM computation tasks. We tested our implementation using data from the state-of-the-art third generation Pacbio DNA sequencers, which produces DNA reads that are tens of kilobases long. Our implementation is up to 9x faster for computing MEM seeds as compared to the fastest CPU implementation running on a server-grade machine with 24 threads. Computing suffix array intervals (first part of MEM computation) is up to 3x faster whereas calculating the location of the match (second part) is up to 9x faster. The implementation is publicly available at https://github.com/nahmedraja/GPUseed. ...

Correction to

GASAL2: A GPU accelerated sequence alignment library for high-Throughput NGS data (BMC Bioinformatics (2019) 20 (520) DOI: 10.1186/s12859-019-3086-9)

Journal article (2019) - Nauman Ahmed, Jonathan Lévy, Shanshan Ren, Hamid Mushtaq, Koen Bertels, Zaid Al-Ars

Following publication of the original article [1], the author requested changes to the figures 4, 7, 8, 9, 12 and 14 to align these with the text. The corrected figures are supplied below. The original article [1] has been corrected. [Typesetter, please insert new supplied figure in package]. ...

GASAL2

A GPU accelerated sequence alignment library for high-throughput NGS data

Journal article (2019) - Nauman Ahmed, Jonathan Lévy, Shanshan Ren, Hamid Mushtaq, Koen Bertels, Zaid Al-Ars

BACKGROUND: Due the computational complexity of sequence alignment algorithms, various accelerated solutions have been proposed to speedup this analysis. NVBIO is the only available GPU library that accelerates sequence alignment of high-throughput NGS data, but has limited performance. In this article we present GASAL2, a GPU library for aligning DNA and RNA sequences that outperforms existing CPU and GPU libraries. RESULTS: The GASAL2 library provides specialized, accelerated kernels for local, global and all types of semi-global alignment. Pairwise sequence alignment can be performed with and without traceback. GASAL2 outperforms the fastest CPU-optimized SIMD implementations such as SeqAn and Parasail, as well as NVIDIA's own GPU-based library known as NVBIO. GASAL2 is unique in performing sequence packing on GPU, which is up to 750x faster than NVBIO. Overall on Geforce GTX 1080 Ti GPU, GASAL2 is up to 21x faster than Parasail on a dual socket hyper-threaded Intel Xeon system with 28 cores and up to 13x faster than NVBIO with a query length of up to 300 bases and 100 bases, respectively. GASAL2 alignment functions are asynchronous/non-blocking and allow full overlap of CPU and GPU execution. The paper shows how to use GASAL2 to accelerate BWA-MEM, speeding up the local alignment by 20x, which gives an overall application speedup of 1.3x vs. CPU with up to 12 threads. CONCLUSIONS: The library provides high performance APIs for local, global and semi-global alignment that can be easily integrated into various bioinformatics tools. ...

Diminished-1 Fermat Number Transform for Integer Convolutional Neural Networks

Conference paper (2019) - Zhu Baozhou, Nauman Ahmed, Johan Peltenburg, Koen Bertels, Zaid Al-Ars

Convolutional Neural Networks (CNNs) are a class of widely used deep artificial neural networks. However, training large CNNs to produce state-of-the-art results can take a long time. In addition, we need to reduce compute time of the inference stage for trained networks to make it accessible for real time applications. In order to achieve this, integer number formats INT8 and INT16 with reduced precision are being used to create Integer Convolutional Neural Networks (ICNNs) to allow them to be deployed on mobile devices or embedded systems. In this paper, Diminished-l Fermat Number Transform (DFNT), which refers to Fermat Number Transform (FNT) with diminished-l number representation, is proposed to accelerate ICNNs through algebraic properties of integer convolution. This is achieved by performing the convolution step as diminished -1 point-wise products between DFNT transformed feature maps, which can be reused multiple times in the calculation. Since representing and computing all the integers in the ring of integers modulo Fermat number 2 {b}+1 for FNT requires b+1 bits, diminished-1 number representation is used to enable exact and efficient calculation. Using DFNT, integer convolution is implemented on a general purpose processor, showing speedup of 2-3x with typical parameter configurations and better scalability without any round-off error compared to the baseline. ...

GPU Accelerated Sequence Alignment with Trace-back for GATK HaplotypeCaller

Journal article (2019) - Shanshan Ren, Nauman Ahmed, Koen Bertels, Zaid Al-Ars

Background: Pairwise sequence alignment is widely used in many biological tools and applications. Existing GPU accelerated implementations mainly focus on calculating optimal alignment score and omit identifying the optimal alignment itself. In GATK HaplotypeCaller (HC), the semi-global pairwise sequence alignment with traceback has so far
been difficult to accelerate effectively on GPUs.
Results: We first analyze the characteristics of the semi-global alignment with traceback in GATK HC and then propose a new algorithm that allows for retrieving the optimal alignment efficiently on GPUs. For the first stage, we choose intra-task parallelization model to calculate the position of the optimal alignment score and the backtracking matrix. Moreover, in the first stage, our GPU implementation also records the length of consecutive matches/mismatches in
addition to lengths of consecutive insertions and deletions as in the CPU-based implementation. This helps efficiently
retrieve the backtracking matrix to obtain the optimal alignment in the second stage.
Conclusions: Experimental results show that our alignment kernel with traceback is up to 80x and 14.14x faster than its CPU counterpart with synthetic datasets and real datasets, respectively. When integrated into GATK HC (alongside a GPU accelerated pair-HMMs forward kernel), the overall acceleration is 2.3x faster than the baseline GATK HC
implementation, and 1.34x faster than the GATK HC implementation with the integrated GPU-based pair-HMMs forward algorithm. Although the methods proposed in this paper is to improve the performance of GATK HC, they can also be used in other pairwise alignments and applications. ...

SparkGA2

Production-quality memory-efficient Apache Spark based genome analysis framework

Journal article (2019) - Hamid Mushtaq, Nauman Ahmed, Zaid Al-Ars

Due to the rapid decrease in the cost of NGS (Next Generation Sequencing), interest has increased in using data generated from NGS to diagnose genetic diseases. However, the data generated by NGS technology is usually in the order of hundreds of gigabytes per experiment, thus requiring efficient and scalable programs to perform data analysis quickly. This paper presents SparkGA2, a memory efficient, production quality framework for high performance DNA analysis in the cloud, which can scale according to the available computational resources by increasing the number of nodes. Our framework uses Apache Spark's ability to cache data in the memory to speed up processing, while also allowing the user to run the framework on systems with lower amounts of memory at the cost of slightly less performance. To manage the memory footprint, we implement an on-the-fly compression method of intermediate data and reduce memory requirements by up to 3x. Our framework also uses a streaming approach to gradually stream input data as processing is taking place. This makes our framework faster than other state of the art approaches while at the same time allowing users to adapt it to run on clusters with lower memory. As compared to the state of the art, SparkGA2 is up to 22% faster on a large big data cluster of 67 nodes and up to 9% faster on a smaller cluster of 6 nodes. Including the streaming solution, where data pre-processing is considered, SparkGA2 is 51% faster on a 6 node cluster. The source code of SparkGA2 is publicly available at https://github.com/HamidMushtaq/SparkGA2. ...

An Efficient GPU-based de Bruijn Graph Construction Algorithm for Micro-Assembly

Conference paper (2018) - Shanshan Ren, Nauman Ahmed, Koen Bertels, Zaid Al-Ars

In order to improve the accuracy of indel detection, micro-assembly is used in multiple variant callers, such as the GATK HaplotypeCaller to reassemble reads in a specific region of the genome. Assembly is a computationally intensive process that causes runtime bottlenecks. In this paper, we propose a GPU-based de Bruijn graph construction algorithm for micro-assembly in the GATK HaplotypeCaller to improve its performance. Various synthetic datasets are used to compare the performance of the GPU-based de Bruijn graph construction implementation with the software-only baseline, which achieves a speedup of up to 3x. An experiment using two human genome datasets is used to evaluate the performance shows a speedup of up to 2.66x. ...

GPU Accelerated API for Alignment of Genomics Sequencing Data

Conference paper (2017) - Nauman Ahmed, Hamid Mushtaq, Koen Bertels, Zaid Al-Ars

Sequence alignment is a core step in the processing of DNA and RNA sequencing data. In this paper, we present a high performance GPU accelerated library (GASAL) for pairwise sequence alignment of DNA and RNA sequences. The GASAL library provides accelerated kernels for local, global as
well as semi-global alignment, allowing the computation of the alignment score, and optionally the start and end positions of the alignment. GASAL outperforms the fastest CPU-optimized SIMD implementations such as SSW and Parasail. It also outperforms NVBIO, NVIDIA’s own CUDA library for sequence analysis of high-throughput sequencing data. GASAL uses the unique approach of also performing the sequence packing on GPU, which is over 200x faster than the NVBIO approach. Overall on Tesla K40c GASAL is 10-14x faster than 28 Intel Xeon cores and 3-4x faster than NVBIO with a query length of 100 bases. The library provides easy to use APIs to allow integration into various bioinformatics tools. ...

Streaming Distributed DNA Sequence Alignment Using Apache Spark

Conference paper (2017) - Hamid Mushtaq, Nauman Ahmed, Zaid Al-Ars

The large amount of data generated by NextGeneration Sequencing (NGS) technology, usually in the order of hundreds of gigabytes per experiment, has to be analyzed quickly to generate meaningful variant results. The first step
in analyzing such data is to map those sequenced reads to their corresponding positions in the human genome. One of the most popular tools to do such sequence alignment is the Burrows-Wheeler Aligner (BWA mem). One limitation of the BWA program though is that it cannot be run on a cluster.
In this paper, we propose StreamBWA, a new framework that allows the BWA mem program to run on a cluster in a distributed fashion, at the same time while the input data is being streamed into the cluster. It can process the input
data directly from a compressed file, which either lies on the local file system or on a URL. Moreover, StreamBWA can start combining the output files of the distributed BWA mem tasks at the same time while these tasks are still being executed on the cluster. Empirical evaluation shows that this streaming
distributed approach is approximately 2x faster than the nonstreaming approach. Furthermore, our streaming distributed approach is 5x faster than other state-of-the-art solutions such as SparkBWA. The source code of StreamBWA is publicly available at https://github.com/HamidMushtaq/StreamBWA. ...

Predictive Genome Analysis Using Partial DNA Sequencing Data

Conference paper (2017) - Nauman Ahmed, Koen Bertels, Zaid Al-Ars

Much research has been dedicated to reducing the computational time associated with the analysis of genome data, which resulted in shifting the bottleneck from the time needed for the computational analysis part to the actual time needed for sequencing of DNA information. DNA sequencing is a time consuming process, and all existing DNA analysis methods have to wait for the DNA sequencing to completely finish before starting the analysis. In this paper, we propose a new DNA analysis approach where we start the genome analysis before the DNA sequencing is completely finished. The genome analysis is started when the DNA reads are still in the process of being sequenced. We use algorithms to predict the unknown bases and their corresponding base quality scores of the incomplete read. Results show that our method of predicting the unknown bases and quality scores achieves more than 90% similarity with the full dataset for 50 unknown bases (slashing more than a day of sequencing time). We also show that our base quality value prediction scheme is highly accurate, only reducing the similarity of the detected variants by 0.45%. However, there is still room to introduce more accurate prediction schemes for the unknown bases to increase the effectiveness of the analysis by up to 5.8%. ...

A Comparison of Seed-and-Extend Techniques in Modern DNA Read Alignment Algorithms

Conference paper (2016) - Nauman Ahmed, Koen Bertels, Zaid Al-Ars

DNA read alignment is a major step in genome analysis. However, as DNA reads continue to become longer, new approaches need to be developed to effectively use these longer reads in the alignment process. Modern aligners commonly use a two-step approach for read alignment: 1. seeding, 2. extension. In this paper, we have investigated various seeding and extension techniques used in modern DNA read alignment algorithms to find the best seeding and extension combinations. We developed an open source generic DNA read aligner that can be used to compare the alignment accuracy and total execution time of different combinations of seeding and extension algorithms. For extension, our results show that local alignment is the best extension approach, achieving up to 3.6x more accuracy than other extension techniques, for longer reads. For seeding, if BLAST-like seed extension is used, the best seeding approach is identifying all SMEMs in the DNA read (e.g., approach used by BWA-MEM). This combination is up to 6x more accurate than other seeding techniques, for longer reads. With local alignment, we observed that the seeding technique does not impact the alignment accuracy. Furthermore, we showed that an optimized implementation of local alignment using vector instructions, enabling 4.5x speedup, makes it the fastest of all extension techniques. Overall, we show that using local alignment with non-overlapping maximal exact matching seeds is the best seeding-extension combination due to its high accuracy and higher potential for optimization/acceleration for future DNA reads. ...

Heterogeneous hardware/software acceleration of the BWA-MEM DNA alignment algorithm

Conference paper (2016) - Nauman Ahmed, VM Sima, Ernst Houtgast, Koen Bertels, Z Al-Ars

The fast decrease in cost of DNA sequencing has resulted in an enormous growth in available genome data, and hence led to an increasing demand for fast DNA analysis algorithms used for diagnostics of genetic disorders, such as cancer. One of the most computationally intensive steps in the analysis is represented by the DNA read alignment. In this paper, we present an accelerated version of BWA-MEM, one of the most popular read alignment algorithms, by implementing a heterogeneous hardware/software optimized version on the Convey HC2ex platform. A challenging factor of the BWAMEM algorithm is the fact that it consists of not one, but three computationally intensive kernels: SMEM generation, suffix array lookup and local Smith-Waterman. Obtaining substantial speedup is hence contingent on accelerating all of these three kernels at once. The paper shows an architecture containing two hardware-accelerated kernels and one kernel optimized in software. The two hardware kernels of suffix array lookup and local Smith-Waterman are able to reach speedups of 2.8x and 5.7x, respectively. The software optimization of the SMEM
generation kernel is able to achieve a speedup of 1.7x. This enables a total application acceleration of 2.6x compared to the original software version. ...