ArrowSAM

None, None; None, None; None, None; None, None

ArrowSAM

In-Memory Genomics Data Processing Using Apache Arrow

Conference Paper (2020)

Author(s)

Tanveer Ahmad (TU Delft - Computer Engineering)

N. Ahmed (TU Delft - Quantum & Computer Engineering, TU Delft - Numerical Analysis)

Johan Peltenburg (TU Delft - Computer Engineering)

Z. Al Ars (TU Delft - Computer Engineering)

Research Group

Computer Engineering

Copyright

DOI related publication

https://doi.org/10.1109/ICCAIS48893.2020.9096725

Big Data Parallel Processing Genomics Apache Arrow Whole Genome/Exome Sequencing In-Memory

To reference this document use:

https://resolver.tudelft.nl/uuid:31a08eed-9416-4787-ac30-9f41f45695fd

More Info

expand_more

Publication Year

2020

Language

English

Copyright

Research Group

Computer Engineering

Pages (from-to)

1-6

ISBN (print)

978-1-7281-4214-2

ISBN (electronic)

978-1-7281-4213-5

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

The rapidly growing size of genomics data bases, driven by advances in sequencing technologies, demands fast and cost-effective processing. However, processing this data creates many challenges, particularly in selecting appropriate algorithms and computing platforms. Computing systems need data closer to the processor for fast processing. Traditionally, due to cost, volatility and other physical constraints of DRAM, it was not feasible to place large amounts of working data sets in memory. However, new emerging storage class memories allow storing and processing big data closer to the processor. In this work, we show how the commonly used genomics data format, Sequence Alignment/Map (SAM), can be presented in the Apache Arrow in-memory data representation to benefit of in-memory processing and to ensure better scalability through shared memory objects, by avoiding large (de)-serialization overheads in cross-language interoperability. To demonstrate the benefits of such a system, we propose ArrowSAM, an in-memory SAM format that uses the Apache Arrow framework, and integrate it into genome pre-processing pipelines including BWA-MEM, Picard and Sambamba. Results show 15x and 2.4x speedups as compared to Picard and Sambamba, respectively. The code and scripts for running all workflows are freely available at https://github.com/abs-tudelft/ArrowSAM.

Files

ArrowSAM.pdf

(pdf | 0.76 Mb)

License info not available