S. Costa | TU Delft Repository

Mutational signatures in the general population

Population-scale mutational signature analysis of blood-derived genomes from the UK Biobank

Master thesis (2026) - K.N.I. Timmerman, Joana Gonçalves, S. Costa, J. Sun, M. Weinmann

Mutational processes leave characteristic patterns of somatic mutations, traditionally studied in tumour tissue. Far less is known about whether they can be observed in normal tissue, particularly blood. Detecting the mutational imprint of disease-associated processes there could enable earlier detection and intervention. Here, we investigate whether the mutational signal detected in blood can be explained by biological and clinical factors, in particular age, DNA repair deficiencies, and cancer diagnoses. Using whole-genome sequencing of blood-derived DNA from 17,419 UK Biobank participants, we developed a filtering strategy to isolate somatic mutations and analysed four views of the mutational landscape: mutation burden, mutation channel composition, exposures to de novo signatures and exposures to COSMIC signatures. We modelled their relation to these factors using regression analysis. Across all four views, sequencing provider was the dominant predictor, far outweighing other predictors. Among non-technical predictors, BRCA (p = 0.024) and POLE (p = 0.030) variants were significantly associated with a higher mutation burden. A leukaemia diagnosis was the strongest signal across the remaining views, appearing in both the mutation channel composition and the exposure to the clock-like signature SBS1 (p = 2.10e-06). De novo signature exposures clustered by sequencing provider, and no association survived when extraction was performed separately for each provider. Our results show that some biological and clinical factors do explain part of the mutational signal in blood, but that technical variation between sequencing providers dominates the mutational landscape and must be addressed before blood can serve as a reliable substrate for mutational analysis. ...

Comparing De Novo and COSMIC Mutational Signatures in Single-Cell Sequencing Data

Bachelor thesis (2025) - F.T.M. de Haas, Sara Costa, Ivan Stresec, Joana Gonçalves, Catharine Oertel Genannt Bierbach

Understanding mutational processes active in cancer at the single-cell level is essential for characterizing intra-tumor heterogeneity. Previous studies extracted these processes, called mutational signatures, and the known signatures can be found in the Catalogue of Somatic Mutations in Cancer (COSMIC) database. These signatures were derived based on bulk sequencing data of thousands of whole genomes. This study proposes and applies a systematic method to compare single-cell-derived de novo mutational signatures to the COSMIC signatures. Using two single-cell cancer datasets (breast and neck cancer), two stable signatures were extracted per dataset. Within each dataset, the de novo signatures were extremely similar (cosine similarity > 0.96), suggesting uniform mutational processes within individual tumors. No direct one-to-one matches were found between de novo and COSMIC signatures. However, the de novo signatures can be interpreted as combinations of known mutational processes. These results demonstrate the feasibility of extracting de novo signatures based on single-cell data, while also highlighting limitations due to possible overfitting. Future work should include simulation experiments, analysis of additional tumors, and evaluation of alternative signature extraction methods. ...

Learning Signature Exposures from Gene Expression at Single-Cell Resolution

Regular vs. Multitask Learning of Individual Regression Models

Bachelor thesis (2025) - A. Potolski Eilat, Joana Gonçalves, S. Costa, I. Stresec, C.R.M.M. Oertel Genannt Bierbach

Understanding the mutational processes active within cancer cells is essential to improve diagnosis and treatment strategies. This study investigates whether the activity levels of these processes, quantified as mutational signature exposures, can be predicted from single-cell gene expression data. Two regression-based learning paradigms are compared: regular independent modelling, where the different models of each mutational signature selects its own regularisation parameter and set of genes, and multitask modelling, where the different models agree on a set of genes to be used for the prediction of each signature, and the regularisation parameter is shared. We evaluate their predictive performance and interpretability using biologically informed metrics. Furthermore, we assess the models’ robustness on unseen data by simulating real-world shifts through clustering-based data splits. Our results show that while both models achieve reasonable predictive accuracy, independently trained models offer greater flexibility and interpretability by identifying signature-specific genes and regularisation strengths. These findings suggest that gene expression carries meaningful information about a cell’s mutational history and that signature-specific modelling may offer better biological insight into tumour heterogeneity. ...

Deciphering Cancer Heterogeneity with Machine Learning

Signature fitting analysis on single cells in relation to pseudo-bulk data

Bachelor thesis (2025) - M.R. Rotar, Joana Gonçalves, S. Costa, I. Stresec, C.R.M.M. Oertel Genannt Bierbach

The field of oncology has greatly benefited due to the study of mutational signatures, pat terns of mutations that appear within the cancer genome. Previous research has focused its resources on utilizing various mathematical models to uncover and understand these mutational signatures by looking at the genetic information of aggregated cells, typically sequenced from a tumor biopsy, which is referred to as bulk data. However, recent developments in sequencing techniques have provided us with the possibility of investigating the genetic information at the single cell level rather than bulk. Thus, in this paper, we utilized machine learning-based tools to examine the effect of performing signature fitting at the single-cell level in relation to pseudo-bulk. We found that single cells have a higher degree of expression by contrast to the pseudo bulk, having the capability to identify a higher number of active mutational signatures. We also saw some single cells achieving better accuracy in the reconstruction of their mutational profile, by comparison to the pseudo-bulk. We identified that the heterogeneity across the single cells could be explained by a small number of clusters, which can potentially elucidate the active signatures found at the level of the pseudo-bulk sample. Finally, we found that some pseudo-bulk samples generated from subpopulations of cells unexpectedly deviate from the single cells which created them. From the findings, we believe that the study of active mutational signatures at the level of single cells has the potential to enlarge our understanding of cancer by providing us a more in-depth view of this disease. However, further research should be undergone in order to either augment or refute these findings, mainly due to the limiting factor of a relatively small number of mutations characterizing our data, together with the absence of a ground-truth for the bulk data. ...

Multivariate Correlation of Mutational Signature Exposures and Gene Expression in Single-Cell Breast Cancer

Bachelor thesis (2025) - T. Dobrin, S. Costa, I. Stresec, Joana Gonçalves, C.R.M.M. Oertel Genannt Bierbach

Understanding the relationship between mutational processes and gene expression patterns is essential for gaining insights into tumor heterogeneity. In this study, we analyze single-cell RNA sequencing data from a breast cancer tumor to investigate associations between mutational signature exposures and gene expression profiles. We propose a scoring method that integrates principal component loadings, canonical correlation analysis (CCA) loadings, and signature contributions to quantify gene-signature associations. Enrichment analysis of the top-ranking genes reveals consistent involvement of extracellular matrix (ECM) receptor interaction, focal adhesion, and immunerelated pathways across multiple mutational signatures. These findings suggest that different mutational processes converge on pathways involved in cell adhesion, invasion, and immune modulation. Our approach demonstrates the utility of multivariate statistical methods combined with enrichment analysis to explore the transcriptional consequences of mutational processes in cancer at the single-cell level. ...

Robustness of Fitted Mutational Signature Exposures in Single-Cell Data

Deciphering Cancer Heterogeneity with Machine Learning

Bachelor thesis (2025) - R. Nys, Joana Gonçalves, S. Costa, I. Stresec, C.R.M.M. Oertel Genannt Bierbach

Tumor heterogeneity complicates mutational signature analysis at the single-cell level, where sparse catalogues and uneven mutation burdens can destabilise exposure estimates. This study quantifies the robustness of fitted mutational signatures in single-cell RNA-seq data from 688 breast-cancer cells. Known COSMIC v3.4 SBS96 signatures were assigned with SigProfilerAssignment and the input data was systematically perturbed by randomly deleting 5%, 10%, 20% and 40% of mutations, repeating each perturbation twenty times. Robustness was assessed with four complementary metrics: (i) persistence of each signature in the dataset, (ii) stability of the number of cells containing each signature, (iii) mean relative error of persignature exposures, and (iv) per-cell cosine similarity between original and perturbed exposure vectors. Six signatures (SBS1, 5, 12, 26, 40c and 54) were consistently recovered, even after 40% deletion, demonstrating that core biological signals may survive substantial data loss. Nevertheless, higher deletion levels triggered progressive overfitting: the number of additional signatures rose from three at 5% deletion to eighteen at 40%. Exposures seemed to shift between highly similar signature pairs (e.g., SBS12 and SBS26, SBS5 and SBS40c), and merging such pairs halved the mean relative error. Signature SBS54, detected in only eight cells and suspected to be artefactual, showed the poorest stability. Across cells, robustness scaled positively with the number of mutations per cell (ρ ≈ 0.38 to 0.59) and negatively with entropy of the exposure vectors (ρ ≈ −0.27 to –0.53), indicating that abundant or signature-dominated catalogues resist perturbation, whereas sparse or evenly distributed ones are more fragile. Together, our results indicate that while some signatures and cells can survive substantial data loss, signature exposures in sparse single-cell catalogues must be interpreted with caution. ...

Tumor heterogeneity complicates mutational signature analysis at the single-cell level, where sparse catalogues and uneven mutation burdens can destabilise exposure estimates. This study quantifies the robustness of fitted mutational signatures in single-cell RNA-seq data from 688 breast-cancer cells. Known COSMIC v3.4 SBS96 signatures were assigned with SigProfilerAssignment and the input data was systematically perturbed by randomly deleting 5%, 10%, 20% and 40% of mutations, repeating each perturbation twenty times. Robustness was assessed with four complementary metrics: (i) persistence of each signature in the dataset, (ii) stability of the number of cells containing each signature, (iii) mean relative error of persignature exposures, and (iv) per-cell cosine similarity between original and perturbed exposure vectors. Six signatures (SBS1, 5, 12, 26, 40c and 54) were consistently recovered, even after 40% deletion, demonstrating that core biological signals may survive substantial data loss. Nevertheless, higher deletion levels triggered progressive overfitting: the number of additional signatures rose from three at 5% deletion to eighteen at 40%. Exposures seemed to shift between highly similar signature pairs (e.g., SBS12 and SBS26, SBS5 and SBS40c), and merging such pairs halved the mean relative error. Signature SBS54, detected in only eight cells and suspected to be artefactual, showed the poorest stability. Across cells, robustness scaled positively with the number of mutations per cell (ρ ≈ 0.38 to 0.59) and negatively with entropy of the exposure vectors (ρ ≈ −0.27 to –0.53), indicating that abundant or signature-dominated catalogues resist perturbation, whereas sparse or evenly distributed ones are more fragile. Together, our results indicate that while some signatures and cells can survive substantial data loss, signature exposures in sparse single-cell catalogues must be interpreted with caution.