The artificially generated microbiome

A study on the generation and potential use cases of predicted meta-omics data

More Info
expand_more

Abstract

Motivation: Imbalances in the human gut microbiome have been linked to various conditions, including inflammatory bowel disease (IBD), diabetes, and mental health disorders. While metagenomics and amplicon sequencing are the most commonly used technologies to characterize microbial communities, they do not capture all layers of functional activity of the microbiome. Unfortunately, data from other meta-omics modalities is generally difficult to obtain, due to high costs and error-prone technologies, among other issues. The growing availability of paired meta-omics data offers an opportunity to develop machine learning models that can infer connections between metagenomics data and other forms of meta-omics data. The aim is to enable the prediction of these other forms of meta-omics data from metagenomics data. To that end, we evaluated several machine learning model architectures on the task of predicting meta-omics features from various meta-omics inputs, and analyzed the robustness of these models, as well as potential use cases of artificially generated microbiome data.

Results: Machine learning models, in particular simpler architectures such as elastic net regression models and random forests, generated reliable predictions of transcript and metabolite abundances, with correlations of up to 0.77 and 0.74, respectively, but predicting protein profiles proved more difficult, with correlations of at most 0.42. We also identified a core set of well-predicted features for each meta-omics output type, and showed that multi-output regression neural networks performed similarly when trained using fewer output features. Lastly, our experiments demonstrated that predicted features can be used for the downstream task of IBD prediction. For instance, accuracy obtained using predicted metabolite abundances was 77%, compared to the 80% accuracy achieved using real metabolomics data.