As a cell, is it better to be single?

Exploring the feasibility of fine-tuning Geneformer on bulk RNA sequencing data

More Info
expand_more

Abstract

Powerful new machine learning models in biomedicine are being developed constantly, further hastened by the advent of transformer-based architectures. These advanced systems can be used for various applications, from diagnostics to assessing drug effectiveness. Many of these are fundamentally cell classification problems. Models like Geneformer [1] use gene expression data to learn how to distinguish between these cell classes. This information is usually obtained through single-cell RNA sequencing. However, the alternative source, bulk RNA sequencing, offers some advantages that make exploring the feasibility of using it to train Geneformer enticing, such as its greater availability and lower cost. In this paper, pseudo-bulk datasets are created from single-cell data by aggregation of gene expressions. A method to generate synthetic single-cell-like data from a bulk dataset is used to create new datasets. Some remain purely synthetic, while others are mixed with real single-cell data. Geneformer is fine-tuned on all generated datasets separately, and its performance in a cell classification problem is measured. It is shown that the more a dataset resembles real single-cell data, the better the model’s performance. Using bulk data to fine-tune Geneformer is proven to be infeasible. The synthetic data fails to effectively fine-tune the model and is proven to not have a meaningful impact when added to a singlecell dataset. It is concluded that the generated synthetic data is of too low quality and that alternative generation methods should be explored.