As a cell, is it better to be single?

Exploring the feasibility of fine-tuning Geneformer on bulk RNA sequencing data

Bachelor Thesis (2024)
Author(s)

A.L. Kuźnicki (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

Marcel Reinders – Mentor (TU Delft - Pattern Recognition and Bioinformatics)

N. Brouwer – Mentor (TU Delft - Pattern Recognition and Bioinformatics)

N.M. Gürel – Graduation committee member (TU Delft - Pattern Recognition and Bioinformatics)

Faculty
Electrical Engineering, Mathematics and Computer Science
More Info
expand_more
Publication Year
2024
Language
English
Graduation Date
24-06-2024
Awarding Institution
Delft University of Technology
Project
CSE3000 Research Project
Programme
Computer Science and Engineering
Faculty
Electrical Engineering, Mathematics and Computer Science
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Powerful new machine learning models in biomedicine are being developed constantly, further hastened by the advent of transformer-based architectures. These advanced systems can be used for various applications, from diagnostics to assessing drug effectiveness. Many of these are fundamentally cell classification problems. Models like Geneformer [1] use gene expression data to learn how to distinguish between these cell classes. This information is usually obtained through single-cell RNA sequencing. However, the alternative source, bulk RNA sequencing, offers some advantages that make exploring the feasibility of using it to train Geneformer enticing, such as its greater availability and lower cost. In this paper, pseudo-bulk datasets are created from single-cell data by aggregation of gene expressions. A method to generate synthetic single-cell-like data from a bulk dataset is used to create new datasets. Some remain purely synthetic, while others are mixed with real single-cell data. Geneformer is fine-tuned on all generated datasets separately, and its performance in a cell classification problem is measured. It is shown that the more a dataset resembles real single-cell data, the better the model’s performance. Using bulk data to fine-tune Geneformer is proven to be infeasible. The synthetic data fails to effectively fine-tune the model and is proven to not have a meaningful impact when added to a singlecell dataset. It is concluded that the generated synthetic data is of too low quality and that alternative generation methods should be explored.

Files

License info not available