Scaling up the GATK RNA-seq Variant Calling Pipeline with Apache Spark

Master Thesis (2018)
Author(s)

Saiyi Wang (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

Zaid Al-Ars – Mentor

Faculty
Electrical Engineering, Mathematics and Computer Science
More Info
expand_more
Publication Year
2018
Language
English
Graduation Date
30-08-2018
Awarding Institution
Delft University of Technology
Programme
Electrical Engineering, Embedded Systems
Faculty
Electrical Engineering, Mathematics and Computer Science
Downloads counter
219
Collections
thesis
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Next-generation sequencing (NGS) technology has dramatically increased the availability of RNA-seq data. Though primarily used for novel gene identification, expression quantification, and splice analysis, RNA-seq is also a cheap and efficient alternative for variant calling to genome sequencing data. RNA sequencing costs less than genome sequencing. Plus, the variants discovered from RNA-seq data are expressed, which is a desired feature for researchers who want to study the relation between genotype and phenotype. What’s more, variants called in RNA-seq data can be used to validate the discoveries from whole-genome sequencing (WGS) or wholeexome sequencing (WES). The GATK team has adapted the Best Practices pipeline to be able to process RNA-seq data from raw FASTQ reads to variants. However, some components of the pipeline are not optimized to process large datasets efficiently. We have studied several scalable solutions that scale up the DNA-seq Best Practices pipeline in hopes of applying the most efficient framework among them to scaling up the RNA-seq pipeline. We select Spark and implement a parallel RNA-seq variant calling pipeline based on the GATK Best Practices recommendations. Whereas the original sequential pipeline takes ~29 hours to process a dataset of 50 GB with one thread, and ~16 hours with 40 threads on a node with 20 Hyper-Threading cores, our implementation takes only ~2 hours with 16 nodes, each of which has 8 CPU cores without Hyper-Threading. Our implementation is also 24.77% faster than the alternative solution while keeping equally accurate results.

Files

License info not available