Synthetic data generation for the optimization of strains in metabolic engineering using latent space representations derived from a Conditional Variational Autoencoder

More Info
expand_more

Abstract

This study investigates the application of generative models for synthetic data generation in pathway optimization experiments within the field of metabolic engineering. Conditional Variational Autoencoders (CVAEs) use neural networks and latent variable distributions to generate new, plausible data samples. We adapt this model by conditioning the training process on the target flux to acquire increased performance.

Additionally, a baseline model, namely Probabilistic Principal Component Analysis (PPCA), was selected for a comparative analysis to generate the underlying latent space to test the hypothesis that a type of Variational Autoencoder (VAE) can be used to learn a reduced-dimensional latent space for configurations of a kinetic pathway model. A dataset comprising 5000 hypothetical configurations of a kinetic pathway model was utilized to extract relationships between elements of a kinetic pathway.

The results indicate that PPCA can model the underlying distribution of the dataset when the latent space is large enough. However, the traditional CVAE might struggle to capture the underlying distribution, resulting in an entangled latent space. The study suggests that an implementation of $eta$-CVAE could lead to a better balance between parts of the objective function during training, offering improved prospects for generating cost-efficient kinetic pathways for combinatorial pathway optimization experiments.