Robustness of Fitted Mutational Signature Exposures in Single-Cell Data

Deciphering Cancer Heterogeneity with Machine Learning

Bachelor Thesis (2025)
Author(s)

R. Nys (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

Joana Gonçalves – Mentor (TU Delft - Pattern Recognition and Bioinformatics)

S. Costa – Mentor (TU Delft - Pattern Recognition and Bioinformatics)

I. Stresec – Mentor (TU Delft - Pattern Recognition and Bioinformatics)

Catharine Oertel – Graduation committee member (TU Delft - Interactive Intelligence)

Faculty
Electrical Engineering, Mathematics and Computer Science
More Info
expand_more
Publication Year
2025
Language
English
Graduation Date
25-06-2025
Awarding Institution
Delft University of Technology
Project
['CSE3000 Research Project']
Programme
['Computer Science and Engineering']
Faculty
Electrical Engineering, Mathematics and Computer Science
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Tumor heterogeneity complicates mutational signature analysis at the single-cell level, where sparse catalogues and uneven mutation burdens can destabilise exposure estimates. This study quantifies the robustness of fitted mutational signatures in single-cell RNA-seq data from 688 breast-cancer cells. Known COSMIC v3.4 SBS96 signatures were assigned with SigProfilerAssignment and the input data was systematically perturbed by randomly deleting 5%, 10%, 20% and 40% of mutations, repeating each perturbation twenty times. Robustness was assessed with four complementary metrics: (i) persistence of each signature in the dataset, (ii) stability of the number of cells containing each signature, (iii) mean relative error of persignature exposures, and (iv) per-cell cosine similarity between original and perturbed exposure vectors. Six signatures (SBS1, 5, 12, 26, 40c and 54) were consistently recovered, even after 40% deletion, demonstrating that core biological signals may survive substantial data loss. Nevertheless, higher deletion levels triggered progressive overfitting: the number of additional signatures rose from three at 5% deletion to eighteen at 40%. Exposures seemed to shift between highly similar signature pairs (e.g., SBS12 and SBS26, SBS5 and SBS40c), and merging such pairs halved the mean relative error. Signature SBS54, detected in only eight cells and suspected to be artefactual, showed the poorest stability. Across cells, robustness scaled positively with the number of mutations per cell (ρ ≈ 0.38 to 0.59) and negatively with entropy of the exposure vectors (ρ ≈ −0.27 to –0.53), indicating that abundant or signature-dominated catalogues resist perturbation, whereas sparse or evenly distributed ones are more fragile. Together, our results indicate that while some signatures and cells can survive substantial data loss, signature exposures in sparse single-cell catalogues must be interpreted with caution.

Files

License info not available