C.F. Seale | TU Delft Repository

MUSICiAn: Genome-wide Identification of Genes Involved in DNA Repair via Control-Free Mutational Spectra Analysis

Journal article (2026) - C.F. Seale, Marco Barazas, Robin van Schendel, Marcel Tijsterman, Joana P. Gonçalves

Understanding DNA double-strand break (DSB) repair is crucial for the development of targeted anticancer therapies, yet the roles of many genes remain unclear. Recent studies show that disruption of known DSB repair genes can alter the sequence-specific distribution of mutations arising after DSB repair, suggesting that genome-wide perturbation screens could be leveraged to identify new DSB genes leading to distinct deviations from the expected wild-type distribution. Given the challenges of designing controls for a genome-wide screen, we explore the high gene throughput to forgo the use of traditional controls by reframing the analysis as an outlier detection problem, assuming that most genes have minimal influence on DSB repair outcomes. We propose MUSICiAn (Mutational Signature Catalogue Analysis), a compositional data analysis method that ranks gene perturbation impact on mutational spectra without controls by measuring deviations from the central tendency considering the distribution of all spectra. We show that MUSICiAn effectively estimates pseudo-controls for the Repair-seq screen, yielding 476 genes and 60 nontargeting controls. We further apply MUSICiAn to the first genome-wide screen of 18 406 genes with mutational spectra readout, MUSIC, reporting that MUSICiAn successfully recovers known DSB repair genes, highlights the spliceosome as a lesser-appreciated player, and reveals candidates for further investigation. ...

Harnessing The CRISPR Data Revolution to Uncover The Secrets of Double-Strand DNA Repair

Doctoral thesis (2025) - C.F. Seale, M.J.T. Reinders, Joana Gonçalves

CRISPR (Clustered Regularly Interspaced Short Palindromic Repeats) technology has transformed molecular biology by enabling a strategy for precise, efficient, and relatively simple genome editing. Guided by a small strand of RNA, CRISPR locates specific DNA sequences within the genome and introduces double-strand breaks (DSBs). A typical cell can detect and fix the damage by invoking one of several DNA repair pathways. However, repair is not error-free and often introduces mutations. The mutagenic nature of repair pathways can be leveraged to disrupt genes or regulatory elements with high specificity, providing a powerful tool for gaining insights into gene function. Researchers can also generate datasets of mutations left behind after DSB induction and repair within different genomic contexts to learn more about the mutagenic effects of DNA repair. In this thesis, we explore challenges and novel approaches for analysing large-scale datasets of mutations and gene essentiality generated via CRISPR technology. ...

X-CRISP: Domain-Adaptable and Interpretable CRISPR Repair Outcome Prediction

Journal article (2025) - C.F. Seale, Joana P. Gonçalves

Motivation Controlling the outcomes of CRISPR editing is crucial for the success of gene therapy. Since donor template-based editing is often inefficient, alternative strategies have emerged that leverage mutagenic end-joining repair instead. Existing machine learning models can accurately predict end-joining repair outcomes; however, generalisability beyond the specific cell line used for training remains a challenge, and interpretability is typically limited by suboptimal feature representation and model architecture. Results We propose X-CRISP, a flexible and interpretable neural network for predicting repair outcome frequencies based on a minimal set of outcome and sequence features, including microhomologies (MH). Outperforming prior models on detailed and aggregate outcome predictions, X-CRISP prioritised MH location over MH sequence properties such as GC content for deletion outcomes. Through transfer learning, we adapted X-CRISP pre-trained on wild-type mESC data to target human cell lines K562, HAP1, U2OS, and mESC lines with altered DNA repair function. Adapted X-CRISP models improved over direct training on target data from as few as 50 samples, suggesting that this strategy could be leveraged to build models for new domains using a fraction of the data required to train models from scratch. ...

SNMF: Integrated Learning of Mutational Signatures and Prediction of DNA Repair Deficiencies

Preprint (2024) - A.C.H. Goossens, Y.I. Tepeli, C.F. Seale, Joana P. Gonçalves

ELISL: early-late integrated synthetic lethality prediction in cancer

Journal article (2023) - Y.I. Tepeli, C.F. Seale, Joana P. Gonçalves

Motivation

Anti-cancer therapies based on synthetic lethality (SL) exploit tumour vulnerabilities for treatment with reduced side effects, by targeting a gene that is jointly essential with another whose function is lost. Computational prediction is key to expedite SL screening, yet existing methods are vulnerable to prevalent selection bias in SL data and reliant on cancer or tissue type-specific omics, which can be scarce. Notably, sequence similarity remains underexplored as a proxy for related gene function and joint essentiality.
Results

We propose ELISL, Early–Late Integrated SL prediction with forest ensembles, using context-free protein sequence embeddings and context-specific omics from cell lines and tissue. Across eight cancer types, ELISL showed superior robustness to selection bias and recovery of known SL genes, as well as promising cross-cancer predictions. Co-occurring mutations in a BRCA gene and ELISL-predicted pairs from the HH, FGF, WNT, or NEIL gene families were associated with longer patient survival times, revealing therapeutic potential. ...

Overcoming Selection Bias in Synthetic Lethality Prediction

Journal article (2022) - Colm Seale, Yasin Tepeli, Joana P. Gonçalves

Motivation
Synthetic lethality (SL) between two genes occurs when simultaneous loss of function leads to cell death. This holds great promise for developing anti-cancer therapeutics that target synthetic lethal pairs of endogenously disrupted genes. Identifying novel SL relationships through exhaustive experimental screens is challenging, due to the vast number of candidate pairs. Computational SL prediction is therefore sought to identify promising SL gene pairs for further experimentation. However, current SL prediction methods lack consideration for generalizability in the presence of selection bias in SL data.
Results
We show that SL data exhibit considerable gene selection bias. Our experiments designed to assess the robustness of SL prediction reveal that models driven by the topology of known SL interactions (e.g. graph, matrix factorization) are especially sensitive to selection bias. We introduce selection bias-resilient synthetic lethality (SBSL) prediction using regularized logistic regression or random forests. Each gene pair is described by 27 molecular features derived from cancer cell line, cancer patient tissue and healthy donor tissue samples. SBSL models are built and tested using approximately 8000 experimentally derived SL pairs across breast, colon, lung and ovarian cancers. Compared to other SL prediction methods, SBSL showed higher predictive performance, better generalizability and robustness to selection bias. Gene dependency, quantifying the essentiality of a gene for cell survival, contributed most to SBSL predictions. Random forests were superior to linear models in the absence of dependency features, highlighting the relevance of mutual exclusivity of somatic mutations, co-expression in healthy tissue and differential expression in tumour samples.
Availability and implementation
https://github.com/joanagoncalveslab/sbsl
Supplementary information
Supplementary data are available at Bioinformatics online. ...

Motivation
Synthetic lethality (SL) between two genes occurs when simultaneous loss of function leads to cell death. This holds great promise for developing anti-cancer therapeutics that target synthetic lethal pairs of endogenously disrupted genes. Identifying novel SL relationships through exhaustive experimental screens is challenging, due to the vast number of candidate pairs. Computational SL prediction is therefore sought to identify promising SL gene pairs for further experimentation. However, current SL prediction methods lack consideration for generalizability in the presence of selection bias in SL data.
Results
We show that SL data exhibit considerable gene selection bias. Our experiments designed to assess the robustness of SL prediction reveal that models driven by the topology of known SL interactions (e.g. graph, matrix factorization) are especially sensitive to selection bias. We introduce selection bias-resilient synthetic lethality (SBSL) prediction using regularized logistic regression or random forests. Each gene pair is described by 27 molecular features derived from cancer cell line, cancer patient tissue and healthy donor tissue samples. SBSL models are built and tested using approximately 8000 experimentally derived SL pairs across breast, colon, lung and ovarian cancers. Compared to other SL prediction methods, SBSL showed higher predictive performance, better generalizability and robustness to selection bias. Gene dependency, quantifying the essentiality of a gene for cell survival, contributed most to SBSL predictions. Random forests were superior to linear models in the absence of dependency features, highlighting the relevance of mutual exclusivity of somatic mutations, co-expression in healthy tissue and differential expression in tumour samples.
Availability and implementation
https://github.com/joanagoncalveslab/sbsl
Supplementary information
Supplementary data are available at Bioinformatics online.