SparkGA

None, None; None, None; None, None; None, None; None, None; None, None

SparkGA

A Spark Framework for Cost Effective, Fast and Accurate DNA Analysis at Scale

Conference Paper (2017)

Author(s)

H Mushtaq (TU Delft - Computer Engineering)

Frank Liu (IBM Research)

Carlos Costa (IBM Yorktown)

Gang Liu (IBM Research)

H.P. Hofstee (IBM Research)

Z. Al-Ars (TU Delft - Computer Engineering)

Research Group

Computer Engineering

DOI related publication

https://doi.org/10.1145/3107411.3107438

To reference this document use:

https://resolver.tudelft.nl/uuid:af953174-57c4-44f1-adbf-b77bd316bd9b

More Info

expand_more

Publication Year

2017

Language

English

Research Group

Computer Engineering

Pages (from-to)

148-157

ISBN (print)

978-1-4503-4722-8

Abstract

In recent years, the cost of NGS (Next Generation Sequencing) technology has dramatically reduced, making it a viable method for
diagnosing genetic diseases. The large amount of data generated by NGS technology, usually in the order of hundreds of gigabytes per experiment, have to be analyzed quickly to generate meaningful variant results. The GATK best practices pipeline from the Broad
Institute is one of the most popular computational pipelines for DNA analysis. Many components of the GATK pipeline are not very
parallelizable though. In this paper, we present SparkGA, a parallel implementation of a DNA analysis pipeline based on the big data
Apache Spark framework. This implementation is highly scalable and capable of parallelizing computation by utilizing data-level
parallelism as well as load balancing techniques. In order to reduce the analysis cost, SparkGA can run on nodes with as little memory as 16GB. For whole genome sequencing experiments, we show that the runtime can be reduced to about 1.5 hours on a 20-node cluster with an accuracy of up to 99.9981%. Moreover, SparkGA is about 71% faster than other state-of-the-art solutions while also being more accurate. The source code of SparkGA is publicly available at ttps://github.com/HamidMushtaq/SparkGA1.git.

No files available

Metadata only record. There are no files for this record.