Scaling up the GATK RNA-seq Variant Calling Pipeline with Apache Spark

Wang, S.

Scaling up the GATK RNA-seq Variant Calling Pipeline with Apache Spark

Master thesis (2018)

Authors

S. Wang Electrical Engineering, Mathematics and Computer Science

Contributors

Zaid Al-Ars (mentor)

Faculty

Electrical Engineering, Mathematics and Computer Science, Electrical Engineering, Mathematics and Computer Science

To reference this document use:

http://resolver.tudelft.nl/uuid:871a645a-81e4-4686-abc0-53944366c6a4

More Info

expand_more

Published Date

30-08-2018

Language

English

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Faculty

Electrical Engineering, Mathematics and Computer Science

Abstract

Next-generation sequencing (NGS) technology has dramatically increased the availability of RNA-seq data. Though primarily used for novel gene identification, expression quantification, and splice analysis, RNA-seq is also a cheap and efficient alternative for variant calling to genome sequencing data. RNA sequencing costs less than genome sequencing. Plus, the variants discovered from RNA-seq data are expressed, which is a desired feature for researchers who want to study the relation between genotype and phenotype. What’s more, variants called in RNA-seq data can be used to validate the discoveries from whole-genome sequencing (WGS) or wholeexome sequencing (WES). The GATK team has adapted the Best Practices pipeline to be able to process RNA-seq data from raw FASTQ reads to variants. However, some components of the pipeline are not optimized to process large datasets efficiently. We have studied several scalable solutions that scale up the DNA-seq Best Practices pipeline in hopes of applying the most efficient framework among them to scaling up the RNA-seq pipeline. We select Spark and implement a parallel RNA-seq variant calling pipeline based on the GATK Best Practices recommendations. Whereas the original sequential pipeline takes ~29 hours to process a dataset of 50 GB with one thread, and ~16 hours with 40 threads on a node with 20 Hyper-Threading cores, our implementation takes only ~2 hours with 16 nodes, each of which has 8 CPU cores without Hyper-Threading. Our implementation is also 24.77% faster than the alternative solution while keeping equally accurate results.

Files

FinalVersion_RNA_Spark_MSc_swa... (pdf)

(pdf | 3.35 Mb)

License info not available