Classification of diverse bacterial populations

More Info
expand_more

Abstract

Accurate diagnosis and treatment of patients infected with multiple strains of a pathogen is a challenging task. The use of whole genome sequencing techniques provide high potential to give proper insight into the microbial composition of human metagenomic samples. Distinguishing multiple strains of a certain species is difficult due to the high similarity in genetic content. Currently several tools aimed at the identification of different strains in metagenomic sequence data are available. We present an independent benchmark to compare the performance of several of these tools. The tools have been evaluated with a variety of synthetic metagenomic samples containing strain mixtures of the species Enterococcus, Escherichia coli and Mycobacterium tuberculosis.
To facilitate this research, a benchmark framework in Python 3 was built. This framework made it possible to test the performance of tools aiming at unraveling the composition of sequence data. It is able to automatically generate batches of metagenomic readsets with custom predefined properties. The tools can easily do their analysis on those reads in a streamlined fashion. The output of the tools are put in a standardized format to make the complete comparison of tools easier.
This framework has been built as part of our Bachelor End Project over the course of 10 weeks. In the first few weeks we became familiar with the domain of bioinformatics and the type of tools that had to be included in this research. The implementation of the framework required thorough understanding of the tools and took quite some time to implement. Towards the end of the project, the framework has been used to run the tools with a large variety of synthetic readsets. Analysis of these outputs resulted in an insightful overview of the tools capabilities as presented in this paper.