Early DNA Analysis Using Incomplete DNA Data

More Info
expand_more

Abstract

In the past few years, considerable attention has been paid to reduce the computational time for the analysis of genome data, which eliminated critical computational bottlenecks in the time needed for the analysis of DNA information. However, the analysis of genome data is still facing time consuming challenges due to the slow speed of DNA sequencing machines. DNA sequencing is a time-consuming process that could take days to sequence even a single sample. This limits the speed of existing DNA analysis methods since they all need to wait for getting the full sequenced DNA data before they start the analysis. As a result, DNA analysis pipelines are not able to benefit from the reduced computational analysis time. Recently, a new method called early DNA analysis was introduced where the genome analysis pipeline is started with
incomplete DNA data before all DNA sequencing finishes, which opens the door to decrease the total time consumption of DNA analysis including the sequencing time. In this thesis, a parallel implementation of the early DNA analysis approach based on the Apache Spark big data framework is proposed to improve its performance. Besides, using incomplete DNA data sets brings also a slight drop of the accuracy in genome analysis. The original method proposed a few simple methods to complete the unknown DNA data, but these can be improved to increase the accuracy. Therefore, a few new algorithms are also proposed and tested to increase accuracy in this thesis. Results show that the proposed scalability solution towards early DNA analysis could achieve a 7.6× speed-up with 97.48% correctness when deployed on a 4-node Power7+ cluster, while one of the advanced completion algorithms could increase the classification accuracy for unknown DNA data by 0.006%.