As a cell, is it better to be single?

None, None

As a cell, is it better to be single?

Exploring the feasibility of fine-tuning Geneformer on bulk RNA sequencing data

Bachelor Thesis (2024)

Author(s)

A.L. Kuźnicki (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

M.J.T. Reinders – Mentor (TU Delft - Pattern Recognition and Bioinformatics)

N. Brouwer – Mentor (TU Delft - Pattern Recognition and Bioinformatics)

N.M. Gürel – Graduation committee member (TU Delft - Pattern Recognition and Bioinformatics)

Faculty

Electrical Engineering, Mathematics and Computer Science

Machine Learning Fine-Tuning Cancer Synthetic Data Generation RNA-Sequencing Geneformer

To reference this document use:

https://resolver.tudelft.nl/uuid:4368bff4-c213-4989-8e28-249a762d5655

More Info

expand_more

Publication Year

2024

Language

English

Graduation Date

24-06-2024

Awarding Institution

Delft University of Technology

Project

['CSE3000 Research Project']

Programme

['Computer Science and Engineering']

Faculty

Electrical Engineering, Mathematics and Computer Science

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Powerful new machine learning models in biomedicine are being developed constantly, further hastened by the advent of transformer-based architectures. These advanced systems can be used for various applications, from diagnostics to assessing drug effectiveness. Many of these are fundamentally cell classification problems. Models like Geneformer [1] use gene expression data to learn how to distinguish between these cell classes. This information is usually obtained through single-cell RNA sequencing. However, the alternative source, bulk RNA sequencing, offers some advantages that make exploring the feasibility of using it to train Geneformer enticing, such as its greater availability and lower cost. In this paper, pseudo-bulk datasets are created from single-cell data by aggregation of gene expressions. A method to generate synthetic single-cell-like data from a bulk dataset is used to create new datasets. Some remain purely synthetic, while others are mixed with real single-cell data. Geneformer is fine-tuned on all generated datasets separately, and its performance in a cell classification problem is measured. It is shown that the more a dataset resembles real single-cell data, the better the model’s performance. Using bulk data to fine-tune Geneformer is proven to be infeasible. The synthetic data fails to effectively fine-tune the model and is proven to not have a meaningful impact when added to a singlecell dataset. It is concluded that the generated synthetic data is of too low quality and that alternative generation methods should be explored.

Files

Research_paper_alan_kuznicki.p... (pdf)

(pdf | 0.768 Mb)

License info not available