Evaluation of DNA scaffolding techniques using PacBio long reads

The goal of this study was to dive into the novel techniques associated with the use of the available Third Generation DNA Sequencing (TGS) platform, PacBio. This 3-year old technology sounded promising right from the beginning but its groundbreaking attributes, like length and error content, required unconventional handling to yield worthwhile results. To date, only a handful of tools exist that operate on TGS datasets compared to the enormous mass of Second Generation Sequencing (SGS) assemblers out there. In order to efficiently utilize this innovative technology, we need to familiarize ourselves with its characteristics and study the mistakes the existing tools made. To achieve this, a simulated environment was created aiming to comprehensively evaluate two popular hybrid scaffolders (AHA, SSPACE-LongRead) and through that, gain knowledge on the impact of PacBio dataset properties on scaffolding. The evaluation was not limited to contiguity performance (N50/90, etc.) but also examined the accuracy of the results.Three different reference genomes (ecoli, arabidopsis and human) were used for the evaluation and multiple runs were executed for statistical purposes. Apart from the simulated experiments, the capabilities of both tools were also tested on a real dataset, cyprinus carpio (carp fish). Comparison-wise, each tool thrived in distinct situations with SSPACE-LongRead demonstrating better contiguity capabilities (~60% longer N50) and shorter execution time (3–14x faster) whereas AHA was the most accurate one (up to 4x less incorrect joins). PacBio dataset coverage and error rate displayed inconsiderable effect on the result on small genome scaffolding but more pronounced on complex ones. Surprisingly, random features of PacBio sequencing, like length distribution and position of error, had a dominant effect on the performance. Finally, on the carp dataset experiments, AHA achieved a 6x increase in N50 and maximum length while SSPACE-LongRead failed to finish the execution (scaffolded one fourth of the genome). It managed, nevertheless, to produce scaffolds with a maximum length of 646Kbp, compared to 501Kbp of AHA.

Subject

bioinformatics
DNA
scaffolding
PacBio
hybrid assembly
SMRT

To reference this document use:

http://resolver.tudelft.nl/uuid:ef7d8499-ff37-4ebc-97d6-ee9804938f5c

Part of collection

Student theses

Document type

master thesis

Rights

Files

PDF

thesis.pdf

1.91 MB

Close viewer