Representation counts: the impact of embedding models on disease detection tasks from microbiome sequencing data

Bachelor Thesis (2022)
Author(s)

M. Strocchi (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

G. Corso – Mentor (Massachusetts Institute of Technology)

P. Liò – Mentor (University of Cambridge)

Jasmijn A. Baaijens – Mentor (TU Delft - Pattern Recognition and Bioinformatics)

Cynthia C. S. Liem – Graduation committee member (TU Delft - Multimedia Computing)

Faculty
Electrical Engineering, Mathematics and Computer Science
Copyright
© 2022 Mattia Strocchi
More Info
expand_more
Publication Year
2022
Language
English
Copyright
© 2022 Mattia Strocchi
Graduation Date
23-06-2022
Awarding Institution
Delft University of Technology
Project
['CSE3000 Research Project']
Programme
['Computer Science and Engineering']
Faculty
Electrical Engineering, Mathematics and Computer Science
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

The human microbiome, the ensemble of microorganisms found in and on the human body, plays a key role in human health and disease. However, the current state of microbiome analysis represents a significant challenge for machine learning algorithms. Datasets of microbiome sequences are often characterized by a regime of large dimensionality and relatively few labels, making it difficult for a model to discriminate features from random noise and avoid overfitting. It is, therefore, paramount to reduce the dimensionality of the input data while preserving their structure and information for a model to properly learn from them. K-mer frequency vectors and learnable representations through encoders are some of the embedding methods that have been proposed in literature to reduce the dimensions of the input space for machine learning algorithms operating on biological sequences. This work aims to compare how various embedding techniques influence the performance of a downstream disease detection task from microbiome sequencing data. In particular, the research shows that k-mer frequency vectors lead to better classification metrics (AUC = 0.88) compared to NeuroSEED embeddings (AUC = 0.76) on euclidean space. The work also presents how the classification problem formulation is critical to improving the overall disease detection performance.

Files

M_Strocchi_2022.pdf
(pdf | 1.23 Mb)
License info not available