Print Email Facebook Twitter Reference-free biomarker mining in metagenomic data using language embedding Title Reference-free biomarker mining in metagenomic data using language embedding Author Agrawal, Isha (TU Delft Electrical Engineering, Mathematics and Computer Science) Contributor Abeel, T.E.P.M.F. (mentor) Hai, R. (graduation committee) Peng, C. (graduation committee) Degree granting institution Delft University of Technology Programme Computer Science Date 2023-04-21 Abstract Metagenomic Next-Generation Sequencing (mNGS) presents a promising avenue to generate massive volume of sequence reads in a short period of time. This has opened opportunities for disease diagnosis based on individual variations and mutations by considering the microbiome profile of each patient. However, the effective use of this data requires the design of appropriate algorithms which can closely represent the metagenomic data in an accurate and condensed manner.In this work, we acknowledged the efficiency of current approaches such as reference based methods and frequency encoding. However, we also recognized the limitations of current methods, such as limiting findings to pre-existing knowledge and inadequate representation of reads and metagenomic samples. Accordingly, we explored a naturallanguage embedding technique, called Doc2vec, as a potential embedding approach for metagenomic study and phenotype prediction.We introduced some modifications in the original Doc2Vec architecture to remove a bottleneck in analysing long reads. This was done by replacing k-mer-level encoding with nucleotide-level representation. We used the embeddings obtained from this method as input to logistic classifier and ridge regression models. We compared the results with Kraken2 on colorectal cancer and type-2 diabetes classification, and for regression tasks on type-2 diabetes-related measures.The results suggest a comparable performance between the proposed method and reference-based method for colorectal cancer classification. For type-2 diabetes dataset, reference-based method performs significantly better. In regression tasks to predict various metrics associated with type-2 diabetes, the proposed representation was comparable to reference-based method for some phenotypes, but lacked flexibility in others, indicating that the applicability of proposed approach strongly depends on the objective, dataset, and target phenotype. Subject BioinformaticsComputer ScienceEmbedding To reference this document use: http://resolver.tudelft.nl/uuid:ad1c6a14-d8a0-4973-9eae-fac46b0bdf9c Part of collection Student theses Document type master thesis Rights © 2023 Isha Agrawal Files PDF Thesis_Isha_Agrawal_5457777.pdf 2.09 MB Close viewer /islandora/object/uuid:ad1c6a14-d8a0-4973-9eae-fac46b0bdf9c/datastream/OBJ/view