Streaming Distributed DNA Sequence Alignment Using Apache Spark

None, None; None, None; None, None

Streaming Distributed DNA Sequence Alignment Using Apache Spark

Conference Paper (2017)

Author(s)

H Mushtaq (TU Delft - Computer Engineering)

N. Ahmed (TU Delft - Computer Engineering)

Z. Al-Ars (TU Delft - Computer Engineering)

Research Group

Computer Engineering

DOI related publication

https://doi.org/10.1109/BIBE.2017.00-57

DNA Big Data Sparks Tools Pipelines Micromechanical devices

To reference this document use:

https://resolver.tudelft.nl/uuid:aa711bf4-e671-4a98-81cf-8ca5a19e207f

More Info

expand_more

Publication Year

2017

Language

English

Research Group

Computer Engineering

Pages (from-to)

188-193

ISBN (print)

978-1-5386-1325-2

ISBN (electronic)

978-1-5386-1324-5

Abstract

The large amount of data generated by NextGeneration Sequencing (NGS) technology, usually in the order of hundreds of gigabytes per experiment, has to be analyzed quickly to generate meaningful variant results. The first step
in analyzing such data is to map those sequenced reads to their corresponding positions in the human genome. One of the most popular tools to do such sequence alignment is the Burrows-Wheeler Aligner (BWA mem). One limitation of the BWA program though is that it cannot be run on a cluster.
In this paper, we propose StreamBWA, a new framework that allows the BWA mem program to run on a cluster in a distributed fashion, at the same time while the input data is being streamed into the cluster. It can process the input
data directly from a compressed file, which either lies on the local file system or on a URL. Moreover, StreamBWA can start combining the output files of the distributed BWA mem tasks at the same time while these tasks are still being executed on the cluster. Empirical evaluation shows that this streaming
distributed approach is approximately 2x faster than the nonstreaming approach. Furthermore, our streaming distributed approach is 5x faster than other state-of-the-art solutions such as SparkBWA. The source code of StreamBWA is publicly available at https://github.com/HamidMushtaq/StreamBWA.

No files available

Metadata only record. There are no files for this record.