Streaming Distributed DNA Sequence Alignment Using Apache Spark

More Info
expand_more

Abstract

The large amount of data generated by NextGeneration Sequencing (NGS) technology, usually in the order of hundreds of gigabytes per experiment, has to be analyzed quickly to generate meaningful variant results. The first step
in analyzing such data is to map those sequenced reads to their corresponding positions in the human genome. One of the most popular tools to do such sequence alignment is the Burrows-Wheeler Aligner (BWA mem). One limitation of the BWA program though is that it cannot be run on a cluster.
In this paper, we propose StreamBWA, a new framework that allows the BWA mem program to run on a cluster in a distributed fashion, at the same time while the input data is being streamed into the cluster. It can process the input
data directly from a compressed file, which either lies on the local file system or on a URL. Moreover, StreamBWA can start combining the output files of the distributed BWA mem tasks at the same time while these tasks are still being executed on the cluster. Empirical evaluation shows that this streaming
distributed approach is approximately 2x faster than the nonstreaming approach. Furthermore, our streaming distributed approach is 5x faster than other state-of-the-art solutions such as SparkBWA. The source code of StreamBWA is publicly available at https://github.com/HamidMushtaq/StreamBWA.

Files