Comparative Analysis of Geneformer and Traditional Machine Learning Techniques in Predicting Perturbation Combination Efficacy on Cancer Cell Lines

An Empirical Evaluation Using the sciplex2 Dataset

More Info
expand_more

Abstract

Cancer poses a significant clinical, social, and economic burden, necessitating the development of effective treatments. Understanding how drugs interact with cancer cells and their downstream effects is critical for creating new therapies and overcoming drug resistance. This paper compares the predictive performance of the Geneformer model with traditional machine learning methods in predicting the response of cancer cells to perturbation combinations using the sciplex2 dataset.

The research involves preprocessing the sciplex2 dataset, training the models, and evaluating their performance in binary classification of cells as either treated or untreated, and the prediction of gene perturbation impacts. While traditional ML models demonstrated higher accuracy in binary classification tasks, Geneformer excelled in predicting the impact of gene perturbations due to its advanced architecture and extensive pre-training on single-cell transcriptomes.

Key findings reveal that highly expression-correlated gene pairs cause the largest shifts in cell classification, underscoring the importance of gene correlations in biological predictions. Geneformer showed a deeper understanding of gene network dynamics, achieving higher maximum Cosine Shifts compared to PCA embeddings and placing less emphasis on highly differentially expressed (HDE) Single Genes. Instead, it focused on HDE Gene Pairs, indicating its potential ability to capture complex downstream effects of gene perturbations.

This study highlights the potential of integrating advanced deep learning models like Geneformer into drug discovery, offering a pathway for more effective and targeted therapeutic interventions.