Print Email Facebook Twitter Evaluation of DNA scaffolding techniques using PacBio long reads Title Evaluation of DNA scaffolding techniques using PacBio long reads Author Patsis, K. Contributor Al-Ars, Z. (mentor) Faculty Electrical Engineering, Mathematics and Computer Science Department Department of Electrical Engineering Programme Computer Engineering Date 2014-09-17 Abstract The goal of this study was to dive into the novel techniques associated with the use of the available Third Generation DNA Sequencing (TGS) platform, PacBio. This 3-year old technology sounded promising right from the beginning but its groundbreaking attributes, like length and error content, required unconventional handling to yield worthwhile results. To date, only a handful of tools exist that operate on TGS datasets compared to the enormous mass of Second Generation Sequencing (SGS) assemblers out there. In order to efficiently utilize this innovative technology, we need to familiarize ourselves with its characteristics and study the mistakes the existing tools made. To achieve this, a simulated environment was created aiming to comprehensively evaluate two popular hybrid scaffolders (AHA, SSPACE-LongRead) and through that, gain knowledge on the impact of PacBio dataset properties on scaffolding. The evaluation was not limited to contiguity performance (N50/90, etc.) but also examined the accuracy of the results.Three different reference genomes (ecoli, arabidopsis and human) were used for the evaluation and multiple runs were executed for statistical purposes. Apart from the simulated experiments, the capabilities of both tools were also tested on a real dataset, cyprinus carpio (carp fish). Comparison-wise, each tool thrived in distinct situations with SSPACE-LongRead demonstrating better contiguity capabilities (~60% longer N50) and shorter execution time (3–14x faster) whereas AHA was the most accurate one (up to 4x less incorrect joins). PacBio dataset coverage and error rate displayed inconsiderable effect on the result on small genome scaffolding but more pronounced on complex ones. Surprisingly, random features of PacBio sequencing, like length distribution and position of error, had a dominant effect on the performance. Finally, on the carp dataset experiments, AHA achieved a 6x increase in N50 and maximum length while SSPACE-LongRead failed to finish the execution (scaffolded one fourth of the genome). It managed, nevertheless, to produce scaffolds with a maximum length of 646Kbp, compared to 501Kbp of AHA. Subject bioinformaticsDNAscaffoldingPacBiohybrid assemblySMRT To reference this document use: http://resolver.tudelft.nl/uuid:ef7d8499-ff37-4ebc-97d6-ee9804938f5c Part of collection Student theses Document type master thesis Rights (c) 2014 Patsis, K. Files PDF thesis.pdf 1.91 MB Close viewer /islandora/object/uuid:ef7d8499-ff37-4ebc-97d6-ee9804938f5c/datastream/OBJ/view