Reference-free biomarker mining in metagenomic data using language embedding

Agrawal, Isha

Reference-free biomarker mining in metagenomic data using language embedding

Title

Reference-free biomarker mining in metagenomic data using language embedding

Author

Agrawal, Isha (TU Delft Electrical Engineering, Mathematics and Computer Science)

Contributor

Abeel, T.E.P.M.F. (mentor)
Hai, R. (graduation committee)
Peng, C. (graduation committee)

Degree granting institution

Delft University of Technology

Programme

Computer Science

Date

2023-04-21

Abstract

Metagenomic Next-Generation Sequencing (mNGS) presents a promising avenue to generate massive volume of sequence reads in a short period of time. This has opened opportunities for disease diagnosis based on individual variations and mutations by considering the microbiome profile of each patient. However, the effective use of this data requires the design of appropriate algorithms which can closely represent the metagenomic data in an accurate and condensed manner.

In this work, we acknowledged the efficiency of current approaches such as reference based methods and frequency encoding. However, we also recognized the limitations of current methods, such as limiting findings to pre-existing knowledge and inadequate representation of reads and metagenomic samples. Accordingly, we explored a natural
language embedding technique, called Doc2vec, as a potential embedding approach for metagenomic study and phenotype prediction.

We introduced some modifications in the original Doc2Vec architecture to remove a bottleneck in analysing long reads. This was done by replacing k-mer-level encoding with nucleotide-level representation. We used the embeddings obtained from this method as input to logistic classifier and ridge regression models. We compared the results with Kraken2 on colorectal cancer and type-2 diabetes classification, and for regression tasks on type-2 diabetes-related measures.

The results suggest a comparable performance between the proposed method and reference-based method for colorectal cancer classification. For type-2 diabetes dataset, reference-based method performs significantly better. In regression tasks to predict various metrics associated with type-2 diabetes, the proposed representation was comparable to reference-based method for some phenotypes, but lacked flexibility in others, indicating that the applicability of proposed approach strongly depends on the objective, dataset, and target phenotype.

Subject

Bioinformatics
Computer Science
Embedding

To reference this document use:

http://resolver.tudelft.nl/uuid:ad1c6a14-d8a0-4973-9eae-fac46b0bdf9c

Part of collection

Student theses

Document type

master thesis

Rights

Files

PDF

Thesis_Isha_Agrawal_5457777.pdf

2.09 MB

Close viewer