Representation counts: the impact of embedding models on disease detection tasks from microbiome sequencing data

None, None

Representation counts: the impact of embedding models on disease detection tasks from microbiome sequencing data

Bachelor Thesis (2022)

Author(s)

M. Strocchi (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

G. Corso – Mentor (Massachusetts Institute of Technology)

P. Liò – Mentor (University of Cambridge)

Jasmijn A. Baaijens – Mentor (TU Delft - Pattern Recognition and Bioinformatics)

Cynthia C. S. Liem – Graduation committee member (TU Delft - Multimedia Computing)

Faculty

Electrical Engineering, Mathematics and Computer Science

Copyright

Machine Learning Microbiome Disease detection

To reference this document use:

https://resolver.tudelft.nl/uuid:c0aaa51a-18b3-4e2a-b507-9cb4bd7c08a9

More Info

expand_more

Publication Year

2022

Language

English

Copyright

Graduation Date

23-06-2022

Awarding Institution

Delft University of Technology

Project

['CSE3000 Research Project']

Programme

['Computer Science and Engineering']

Faculty

Electrical Engineering, Mathematics and Computer Science

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

The human microbiome, the ensemble of microorganisms found in and on the human body, plays a key role in human health and disease. However, the current state of microbiome analysis represents a significant challenge for machine learning algorithms. Datasets of microbiome sequences are often characterized by a regime of large dimensionality and relatively few labels, making it difficult for a model to discriminate features from random noise and avoid overfitting. It is, therefore, paramount to reduce the dimensionality of the input data while preserving their structure and information for a model to properly learn from them. K-mer frequency vectors and learnable representations through encoders are some of the embedding methods that have been proposed in literature to reduce the dimensions of the input space for machine learning algorithms operating on biological sequences. This work aims to compare how various embedding techniques influence the performance of a downstream disease detection task from microbiome sequencing data. In particular, the research shows that k-mer frequency vectors lead to better classification metrics (AUC = 0.88) compared to NeuroSEED embeddings (AUC = 0.76) on euclidean space. The work also presents how the classification problem formulation is critical to improving the overall disease detection performance.

Files

M_Strocchi_2022.pdf

(pdf | 1.23 Mb)

License info not available