High-Performance Optimization of DNA Long Read De Novo Assembler

More Info
expand_more

Abstract

This thesis focuses on accelerating the polishing stage of the Flye genome assemblers. Flye is a de novo assembler designed for long reads produced by modern sequencing technologies, excelling in handling large genomes with high accuracy and efficiency. A crucial component of the assembly process is the polishing stage, which refines the draft assembly to correct errors and improve overall accuracy. However, this stage is computationally intensive and time-consuming, presenting a significant bottleneck in genome assembly workflows.

To address this, a novel multi-threading architecture is introduced, significantly reducing mutex contention by minimizing the use and acquisition times of mutexes within the bubble processor. Additionally, advanced vectorization techniques using AVX (Advanced Vector Extensions) instructions are incorporated to process multiple reads simultaneously. These optimizations effectively parallelize the polishing process and exploit modern CPU capabilities for enhanced performance.

Benchmarking the enhanced polishing stage on both bacteria and human genome datasets demonstrates a substantial improvement in processing time. For the bacteria dataset, the error correction process achieves speedups of 3.0x and 4.3x using AVX2 and AVX-512 instructions running on one core, respectively. The process realizes speedups of 2.6x and 2.7x with AVX2 and AVX-512 running on eight cores. For the human genome dataset, the process demonstrates a speedup of 4.0x when handling 1 million bubbles running on one core, while 32 cores yield a speedup of 2.3x for the same dataset. Applying AVX2 to the complete dataset on 64 cores results in a speedup of 1.4x. This acceleration not only reduces computational costs but also expedites the overall genome assembly process, making it more feasible for large-scale and time-sensitive genomic studies. The implementation is available on GitHub.

Files

Kaiyi_Zhao_MSc_thesis.pdf
(pdf | 0.539 Mb)
License info not available