Scaling up the GATK RNA-seq Variant Calling Pipeline with Apache Spark

More Info
expand_more

Abstract

Next-generation sequencing (NGS) technology has dramatically increased the availability of RNA-seq data. Though primarily used for novel gene identification, expression quantification, and splice analysis, RNA-seq is also a cheap and efficient alternative for variant calling to genome sequencing data. RNA sequencing costs less than genome sequencing. Plus, the variants discovered from RNA-seq data are expressed, which is a desired feature for researchers who want to study the relation between genotype and phenotype. What’s more, variants called in RNA-seq data can be used to validate the discoveries from whole-genome sequencing (WGS) or wholeexome sequencing (WES). The GATK team has adapted the Best Practices pipeline to be able to process RNA-seq data from raw FASTQ reads to variants. However, some components of the pipeline are not optimized to process large datasets efficiently. We have studied several scalable solutions that scale up the DNA-seq Best Practices pipeline in hopes of applying the most efficient framework among them to scaling up the RNA-seq pipeline. We select Spark and implement a parallel RNA-seq variant calling pipeline based on the GATK Best Practices recommendations. Whereas the original sequential pipeline takes ~29 hours to process a dataset of 50 GB with one thread, and ~16 hours with 40 threads on a node with 20 Hyper-Threading cores, our implementation takes only ~2 hours with 16 nodes, each of which has 8 CPU cores without Hyper-Threading. Our implementation is also 24.77% faster than the alternative solution while keeping equally accurate results.