Reference-free biomarker mining in metagenomic data using language embedding

More Info
expand_more

Abstract

Metagenomic Next-Generation Sequencing (mNGS) presents a promising avenue to generate massive volume of sequence reads in a short period of time. This has opened opportunities for disease diagnosis based on individual variations and mutations by considering the microbiome profile of each patient. However, the effective use of this data requires the design of appropriate algorithms which can closely represent the metagenomic data in an accurate and condensed manner.

In this work, we acknowledged the efficiency of current approaches such as reference based methods and frequency encoding. However, we also recognized the limitations of current methods, such as limiting findings to pre-existing knowledge and inadequate representation of reads and metagenomic samples. Accordingly, we explored a natural
language embedding technique, called Doc2vec, as a potential embedding approach for metagenomic study and phenotype prediction.

We introduced some modifications in the original Doc2Vec architecture to remove a bottleneck in analysing long reads. This was done by replacing k-mer-level encoding with nucleotide-level representation. We used the embeddings obtained from this method as input to logistic classifier and ridge regression models. We compared the results with Kraken2 on colorectal cancer and type-2 diabetes classification, and for regression tasks on type-2 diabetes-related measures.

The results suggest a comparable performance between the proposed method and reference-based method for colorectal cancer classification. For type-2 diabetes dataset, reference-based method performs significantly better. In regression tasks to predict various metrics associated with type-2 diabetes, the proposed representation was comparable to reference-based method for some phenotypes, but lacked flexibility in others, indicating that the applicability of proposed approach strongly depends on the objective, dataset, and target phenotype.