Joana Gonçalves | TU Delft Repository

MUSICiAn: Genome-wide Identification of Genes Involved in DNA Repair via Control-Free Mutational Spectra Analysis

Journal article (2026) - C.F. Seale, Marco Barazas, Robin van Schendel, Marcel Tijsterman, Joana P. Gonçalves

Understanding DNA double-strand break (DSB) repair is crucial for the development of targeted anticancer therapies, yet the roles of many genes remain unclear. Recent studies show that disruption of known DSB repair genes can alter the sequence-specific distribution of mutations arising after DSB repair, suggesting that genome-wide perturbation screens could be leveraged to identify new DSB genes leading to distinct deviations from the expected wild-type distribution. Given the challenges of designing controls for a genome-wide screen, we explore the high gene throughput to forgo the use of traditional controls by reframing the analysis as an outlier detection problem, assuming that most genes have minimal influence on DSB repair outcomes. We propose MUSICiAn (Mutational Signature Catalogue Analysis), a compositional data analysis method that ranks gene perturbation impact on mutational spectra without controls by measuring deviations from the central tendency considering the distribution of all spectra. We show that MUSICiAn effectively estimates pseudo-controls for the Repair-seq screen, yielding 476 genes and 60 nontargeting controls. We further apply MUSICiAn to the first genome-wide screen of 18 406 genes with mutational spectra readout, MUSIC, reporting that MUSICiAn successfully recovers known DSB repair genes, highlights the spliceosome as a lesser-appreciated player, and reveals candidates for further investigation. ...

X-CRISP: Domain-Adaptable and Interpretable CRISPR Repair Outcome Prediction

Journal article (2025) - C.F. Seale, Joana P. Gonçalves

Motivation Controlling the outcomes of CRISPR editing is crucial for the success of gene therapy. Since donor template-based editing is often inefficient, alternative strategies have emerged that leverage mutagenic end-joining repair instead. Existing machine learning models can accurately predict end-joining repair outcomes; however, generalisability beyond the specific cell line used for training remains a challenge, and interpretability is typically limited by suboptimal feature representation and model architecture. Results We propose X-CRISP, a flexible and interpretable neural network for predicting repair outcome frequencies based on a minimal set of outcome and sequence features, including microhomologies (MH). Outperforming prior models on detailed and aggregate outcome predictions, X-CRISP prioritised MH location over MH sequence properties such as GC content for deletion outcomes. Through transfer learning, we adapted X-CRISP pre-trained on wild-type mESC data to target human cell lines K562, HAP1, U2OS, and mESC lines with altered DNA repair function. Adapted X-CRISP models improved over direct training on target data from as few as 50 samples, suggesting that this strategy could be leveraged to build models for new domains using a fraction of the data required to train models from scratch. ...

LAVA: Explainability for Unsupervised Latent Embeddings

Preprint (2025) - I. Stresec, Joana P. Gonçalves

Single-cell spatial transcriptomics reveals shared transcriptional responses to amyloid proximity in Alzheimer's disease and type 2 diabetes

Journal article (2025) - Roy Lardenoije, Angela R.S. Kruse, Lukasz G. Migas, Claire F. Scott, Cody Marshall, Morad C. Malek, Adel Eskaros, Raf Van de Plas, Joana P. Gonçalves, More authors...

Background
The presence of amyloid pathology can have a profound effect on the surrounding cellular neighborhood. While this impact has been mainly investigated for amyloid plaques in the context of Alzheimer's disease (AD), other forms of amyloid deposits can also be found in the brain and in other organs. In the pancreas, amyloid deposits consist of islet amyloid polypeptide (IAPP) and are a hallmark of type 2 diabetes (T2D). Notably, T2D has been associated with an increased risk of developing AD, and as such T2D is a common comorbidity of AD. It has therefore been suggested that these diseases may share pathophysiological processes. To advance our understanding in this respect, we compared the cellular and transcriptomic responses related to the proximity of amyloid pathology across the AD brain and T2D pancreas.

Method
Xenium single-cell spatial transcriptomic profiling was applied to tissue sections from a human post-mortem AD brain (150,060 cells) and a T2D pancreas (256,907 cells). Spatial transcriptomics images were integrated with amyloid histopathology images to determine the proximity of individual cells to amyloid deposits. Together with cell type predictions, this enabled the investigation and cross-organ comparison of amyloid-associated changes in cell type composition and gene expression changes.

Result
With respect to cell type composition, in the brain a higher proportion of microglia could be observed close to amyloid pathology, while in the pancreas this was mirrored by a higher proportion of macrophages as well as a higher proportion of activated stellate cells. Cell type specific differential gene expression analysis based on amyloid proximity revealed many cell types with altered gene expression, including astrocytes, microglia, oligodendrocytes and endothelial cells in the brain and acinar, alpha and activated stellate cells in the pancreas. Comparison across organs revealed 16 shared genes differentially expressed with proximity to amyloid deposits, including CAV1, CXCR4, MS4A6A, SNCG, and SOX2.

Conclusion
Here we spatially investigate the impact of amyloid deposits on the cellular and transcriptomic microenvironment in the brain and pancreas. Our analysis revealed a common set of amyloid proximity related genes, providing insight into potentially shared pathological pathways underlying AD and T2D. ...

Background
The presence of amyloid pathology can have a profound effect on the surrounding cellular neighborhood. While this impact has been mainly investigated for amyloid plaques in the context of Alzheimer's disease (AD), other forms of amyloid deposits can also be found in the brain and in other organs. In the pancreas, amyloid deposits consist of islet amyloid polypeptide (IAPP) and are a hallmark of type 2 diabetes (T2D). Notably, T2D has been associated with an increased risk of developing AD, and as such T2D is a common comorbidity of AD. It has therefore been suggested that these diseases may share pathophysiological processes. To advance our understanding in this respect, we compared the cellular and transcriptomic responses related to the proximity of amyloid pathology across the AD brain and T2D pancreas.

Method
Xenium single-cell spatial transcriptomic profiling was applied to tissue sections from a human post-mortem AD brain (150,060 cells) and a T2D pancreas (256,907 cells). Spatial transcriptomics images were integrated with amyloid histopathology images to determine the proximity of individual cells to amyloid deposits. Together with cell type predictions, this enabled the investigation and cross-organ comparison of amyloid-associated changes in cell type composition and gene expression changes.

Result
With respect to cell type composition, in the brain a higher proportion of microglia could be observed close to amyloid pathology, while in the pancreas this was mirrored by a higher proportion of macrophages as well as a higher proportion of activated stellate cells. Cell type specific differential gene expression analysis based on amyloid proximity revealed many cell types with altered gene expression, including astrocytes, microglia, oligodendrocytes and endothelial cells in the brain and acinar, alpha and activated stellate cells in the pancreas. Comparison across organs revealed 16 shared genes differentially expressed with proximity to amyloid deposits, including CAV1, CXCR4, MS4A6A, SNCG, and SOX2.

Conclusion
Here we spatially investigate the impact of amyloid deposits on the cellular and transcriptomic microenvironment in the brain and pancreas. Our analysis revealed a common set of amyloid proximity related genes, providing insight into potentially shared pathological pathways underlying AD and T2D.

SNMF: Integrated Learning of Mutational Signatures and Prediction of DNA Repair Deficiencies

Preprint (2024) - A.C.H. Goossens, Y.I. Tepeli, C.F. Seale, Joana P. Gonçalves

Correction to

Advances and prospects for the Human BioMolecular Atlas Program (HuBMAP) (Nature Cell Biology, (2023), 25, 8, (1089-1100), 10.1038/s41556-023-01194-w)

Journal article (2024) - Sanjay Jain, Liming Pei, Joana P. Gonçalves, Huiping Liu, Paul Robson, Raf Van de Plas, Roy Lardenoije, Lukasz G. Migas, Roger Moens, More authors...

Correction to: Nature Cell Biologyhttps://doi.org/10.1038/s41556-023-01194-w. Published online 19 July 2023. In the version of this article originally published, the name of Tianyang Xu was misspelled as Tiangyang Xu. The name has been corrected in the HTML and PDF versions of the article. ...

Metric-DST: Mitigating Selection Bias Through Diversity-Guided Semi-Supervised Metric Learning

Preprint (2024) - Y.I. Tepeli, M.J. de Wolf, Joana P. Gonçalves

DCAST: Diverse Class-Aware Self-Training Mitigates Selection Bias for Fairer Learning

Preprint (2024) - Y.I. Tepeli, Joana P. Gonçalves

The chromatin landscape of healthy and injured cell types in the human kidney

Journal article (2024) - Debora L. Gisch, Michelle Brennan, Blue B. Lake, Jeannine Basta, Mark S. Keller, Joana P. Gonçalves, L.G. Migas, Raf Van de Plas, R. Lardenoije, More Authors...

There is a need to define regions of gene activation or repression that control human kidney cells in states of health, injury, and repair to understand the molecular pathogenesis of kidney disease and design therapeutic strategies. Comprehensive integration of gene expression with epigenetic features that define regulatory elements remains a significant challenge. We measure dual single nucleus RNA expression and chromatin accessibility, DNA methylation, and H3K27ac, H3K4me1, H3K4me3, and H3K27me3 histone modifications to decipher the chromatin landscape and gene regulation of the kidney in reference and adaptive injury states. We establish a spatially-anchored epigenomic atlas to define the kidney’s active, silent, and regulatory accessible chromatin regions across the genome. Using this atlas, we note distinct control of adaptive injury in different epithelial cell types. A proximal tubule cell transcription factor network of ELF3, KLF6, and KLF10 regulates the transition between health and injury, while in thick ascending limb cells this transition is regulated by NR2F1. Further, combined perturbation of ELF3, KLF6, and KLF10 distinguishes two adaptive proximal tubular cell subtypes, one of which manifested a repair trajectory after knockout. This atlas will serve as a foundation to facilitate targeted cell-specific therapeutics by reprogramming gene regulatory networks. ...

ELISL: early-late integrated synthetic lethality prediction in cancer

Journal article (2023) - Y.I. Tepeli, C.F. Seale, Joana P. Gonçalves

Motivation

Anti-cancer therapies based on synthetic lethality (SL) exploit tumour vulnerabilities for treatment with reduced side effects, by targeting a gene that is jointly essential with another whose function is lost. Computational prediction is key to expedite SL screening, yet existing methods are vulnerable to prevalent selection bias in SL data and reliant on cancer or tissue type-specific omics, which can be scarce. Notably, sequence similarity remains underexplored as a proxy for related gene function and joint essentiality.
Results

We propose ELISL, Early–Late Integrated SL prediction with forest ensembles, using context-free protein sequence embeddings and context-specific omics from cell lines and tissue. Across eight cancer types, ELISL showed superior robustness to selection bias and recovery of known SL genes, as well as promising cross-cancer predictions. Co-occurring mutations in a BRCA gene and ELISL-predicted pairs from the HH, FGF, WNT, or NEIL gene families were associated with longer patient survival times, revealing therapeutic potential. ...

Advances and prospects for the Human BioMolecular Atlas Program (HuBMAP)

Journal article (2023) - Sanjay Jain, Liming Pei, Jeffrey M. Spraggins, Michael Angelo, Joana P. Gonçalves, Raf Van de Plas, R. Lardenoije, L.G. Migas, R.A.R. Moens, More authors...

The Human BioMolecular Atlas Program (HuBMAP) aims to create a multi-scale spatial atlas of the healthy human body at single-cell resolution by applying advanced technologies and disseminating resources to the community. As the HuBMAP moves past its first phase, creating ontologies, protocols and pipelines, this Perspective introduces the production phase: the generation of reference spatial maps of functional tissue units across many organs from diverse populations and the creation of mapping tools and infrastructure to advance biomedical research. ...

Overcoming Selection Bias in Synthetic Lethality Prediction

Journal article (2022) - Colm Seale, Yasin Tepeli, Joana P. Gonçalves

Motivation
Synthetic lethality (SL) between two genes occurs when simultaneous loss of function leads to cell death. This holds great promise for developing anti-cancer therapeutics that target synthetic lethal pairs of endogenously disrupted genes. Identifying novel SL relationships through exhaustive experimental screens is challenging, due to the vast number of candidate pairs. Computational SL prediction is therefore sought to identify promising SL gene pairs for further experimentation. However, current SL prediction methods lack consideration for generalizability in the presence of selection bias in SL data.
Results
We show that SL data exhibit considerable gene selection bias. Our experiments designed to assess the robustness of SL prediction reveal that models driven by the topology of known SL interactions (e.g. graph, matrix factorization) are especially sensitive to selection bias. We introduce selection bias-resilient synthetic lethality (SBSL) prediction using regularized logistic regression or random forests. Each gene pair is described by 27 molecular features derived from cancer cell line, cancer patient tissue and healthy donor tissue samples. SBSL models are built and tested using approximately 8000 experimentally derived SL pairs across breast, colon, lung and ovarian cancers. Compared to other SL prediction methods, SBSL showed higher predictive performance, better generalizability and robustness to selection bias. Gene dependency, quantifying the essentiality of a gene for cell survival, contributed most to SBSL predictions. Random forests were superior to linear models in the absence of dependency features, highlighting the relevance of mutual exclusivity of somatic mutations, co-expression in healthy tissue and differential expression in tumour samples.
Availability and implementation
https://github.com/joanagoncalveslab/sbsl
Supplementary information
Supplementary data are available at Bioinformatics online. ...

Motivation
Synthetic lethality (SL) between two genes occurs when simultaneous loss of function leads to cell death. This holds great promise for developing anti-cancer therapeutics that target synthetic lethal pairs of endogenously disrupted genes. Identifying novel SL relationships through exhaustive experimental screens is challenging, due to the vast number of candidate pairs. Computational SL prediction is therefore sought to identify promising SL gene pairs for further experimentation. However, current SL prediction methods lack consideration for generalizability in the presence of selection bias in SL data.
Results
We show that SL data exhibit considerable gene selection bias. Our experiments designed to assess the robustness of SL prediction reveal that models driven by the topology of known SL interactions (e.g. graph, matrix factorization) are especially sensitive to selection bias. We introduce selection bias-resilient synthetic lethality (SBSL) prediction using regularized logistic regression or random forests. Each gene pair is described by 27 molecular features derived from cancer cell line, cancer patient tissue and healthy donor tissue samples. SBSL models are built and tested using approximately 8000 experimentally derived SL pairs across breast, colon, lung and ovarian cancers. Compared to other SL prediction methods, SBSL showed higher predictive performance, better generalizability and robustness to selection bias. Gene dependency, quantifying the essentiality of a gene for cell survival, contributed most to SBSL predictions. Random forests were superior to linear models in the absence of dependency features, highlighting the relevance of mutual exclusivity of somatic mutations, co-expression in healthy tissue and differential expression in tumour samples.
Availability and implementation
https://github.com/joanagoncalveslab/sbsl
Supplementary information
Supplementary data are available at Bioinformatics online.

Multiplexed Cas9 targeting reveals genomic location effects and gRNA-based staggered breaks influencing mutation efficiency

Journal article (2019) - Santiago Gisler, Joana P. Gonçalves, Waseem Akhtar, Johann de Jong, Alexey V. Pindyurin, Lodewyk F.A. Wessels, Maarten van Lohuizen

Understanding the impact of guide RNA (gRNA) and genomic locus on CRISPR-Cas9 activity is crucial to design effective gene editing assays. However, it is challenging to profile Cas9 activity in the endogenous cellular environment. Here we leverage our TRIP technology to integrate ~ 1k barcoded reporter genes in the genomes of mouse embryonic stem cells. We target the integrated reporters (IRs) using RNA-guided Cas9 and characterize induced mutations by sequencing. We report that gRNA-sequence and IR locus explain most variation in mutation efficiency. Predominant insertions of a gRNA-specific nucleotide are consistent with template-dependent repair of staggered DNA ends with 1-bp 5′ overhangs. We confirm that such staggered ends are induced by Cas9 in mouse pre-B cells. To explain observed insertions, we propose a model generating primarily blunt and occasionally staggered DNA ends. Mutation patterns indicate that gRNA-sequence controls the fraction of staggered ends, which could be used to optimize Cas9-based insertion efficiency. ...

Predicting disease associations via biological network analysis

Journal article (2014) - Kai Sun, Joana P. Gonçalves, Chris Larminie, Nataša Pržulj

Background
Understanding the relationship between diseases based on the underlying biological mechanisms is one of the greatest challenges in modern biology and medicine. Exploring disease-disease associations by using system-level biological data is expected to improve our current knowledge of disease relationships, which may lead to further improvements in disease diagnosis, prognosis and treatment.
Results
We took advantage of diverse biological data including disease-gene associations and a large-scale molecular network to gain novel insights into disease relationships. We analysed and compared four publicly available disease-gene association datasets, then applied three disease similarity measures, namely annotation-based measure, function-based measure and topology-based measure, to estimate the similarity scores between diseases. We systematically evaluated disease associations obtained by these measures against a statistical measure of comorbidity which was derived from a large number of medical patient records. Our results show that the correlation between our similarity measures and comorbidity scores is substantially higher than expected at random, confirming that our similarity measures are able to recover comorbidity associations. We also demonstrated that our predicted disease associations correlated with disease associations generated from genome-wide association studies significantly higher than expected at random. Furthermore, we evaluated our predicted disease associations via mining the literature on PubMed, and presented case studies to demonstrate how these novel disease associations can be used to enhance our current knowledge of disease relationships.
Conclusions
We present three similarity measures for predicting disease associations. The strong correlation between our predictions and known disease associations demonstrates the ability of our measures to provide novel insights into disease relationships.
...

Background
Understanding the relationship between diseases based on the underlying biological mechanisms is one of the greatest challenges in modern biology and medicine. Exploring disease-disease associations by using system-level biological data is expected to improve our current knowledge of disease relationships, which may lead to further improvements in disease diagnosis, prognosis and treatment.
Results
We took advantage of diverse biological data including disease-gene associations and a large-scale molecular network to gain novel insights into disease relationships. We analysed and compared four publicly available disease-gene association datasets, then applied three disease similarity measures, namely annotation-based measure, function-based measure and topology-based measure, to estimate the similarity scores between diseases. We systematically evaluated disease associations obtained by these measures against a statistical measure of comorbidity which was derived from a large number of medical patient records. Our results show that the correlation between our similarity measures and comorbidity scores is substantially higher than expected at random, confirming that our similarity measures are able to recover comorbidity associations. We also demonstrated that our predicted disease associations correlated with disease associations generated from genome-wide association studies significantly higher than expected at random. Furthermore, we evaluated our predicted disease associations via mining the literature on PubMed, and presented case studies to demonstrate how these novel disease associations can be used to enhance our current knowledge of disease relationships.
Conclusions
We present three similarity measures for predicting disease associations. The strong correlation between our predictions and known disease associations demonstrates the ability of our measures to provide novel insights into disease relationships.

The YEASTRACT database

An upgraded information system for the analysis of gene and genomic transcription regulation in Saccharomyces cerevisiae

Journal article (2014) - Miguel Cacho Teixeira, Pedro Tiago Monteiro, Sara Cordeiro Madeira, Arlindo Limede Oliveira, Ana Teresa Freitas, Isabel Sa-Correia, Joana Fernandes Guerreiro, Joana Pinho Goncalves, Nuno Pereira Mira, Sandra Costa dos Santos, Tania Rodrigues Cabrito, Margarida Palma, Catarina Costa, Alexandre Paulo Francisco

The YEASTRACT (http://www.yeastract.com) information system is a tool for the analysis and prediction of transcription regulatory associations in Saccharomyces cerevisiae. Last updated in June 2013, this database contains over 200 000 regulatory associations between transcription factors (TFs) and target genes, including 326 DNA binding sites for 113 TFs. All regulatory associations stored in YEASTRACT were revisited and new information was added on the experimental conditions in which those associations take place and on whether the TF is acting on its target genes as activator or repressor. Based on this information, new queries were developed allowing the selection of specific environmental conditions, experimental evidence or positive/negative regulatory effect. This release further offers tools to rank the TFs controlling a gene or genome-wide response by their relative importance, based on (i) the percentage of target genes in the data set; (ii) the enrichment of the TF regulon in the data set when compared with the genome; or (iii) the score computed using the TFRank system, which selects and prioritizes the relevant TFs by walking through the yeast regulatory network. We expect that with the new data and services made available, the system will continue to be instrumental for yeast biologists and systems biology researchers. ...

LateBiclustering

Efficient Heuristic Algorithm for Time-Lagged Bicluster Identification

Journal article (2014) - Joana P. Gonçalves, Sara Cordeiro Madeira

Identifying patterns in temporal data is key to uncover meaningful relationships in diverse domains, from stock trading to social interactions. Also of great interest are clinical and biological applications, namely monitoring patient response to treatment or characterizing activity at the molecular level. In biology, researchers seek to gain insight into gene functions and dynamics of biological processes, as well as potential perturbations of these leading to disease, through the study of patterns emerging from gene expression time series. Clustering can group genes exhibiting similar expression profiles, but focuses on global patterns denoting rather broad, unspecific responses. Biclustering reveals local patterns, which more naturally capture the intricate collaboration between biological players, particularly under a temporal setting. Despite the general biclustering formulation being NP-hard, considering specific properties of time series has led to efficient solutions for the discovery of temporally aligned patterns. Notably, the identification of biclusters with time-lagged patterns, suggestive of transcriptional cascades, remains a challenge due to the combinatorial explosion of delayed occurrences. Herein, we propose LateBiclustering, a sensible heuristic algorithm enabling a polynomial rather than exponential time solution for the problem. We show that it identifies meaningful time-lagged biclusters relevant to the response of Saccharomyces cerevisiae to heat stress. ...

Heuristic approaches for time-lagged biclustering

Conference paper (2013) - Joana P. Gonçalves, Sara C. Madeira

Identifying patterns in temporal data supports complex analyses in several domains, including stock markets (finance) and social interactions (social science). Clinical and biological applications, such as monitoring patient response to treatment or characterizing activity at the molecular level, are also of interest. In particular, researchers seek to gain insight into the dynamics of biological processes, and potential perturbations of these leading to disease, through the discovery of patterns in time series gene expression data. For many years, clustering has remained the standard technique to group genes exhibiting similar response profiles. However, clustering defines similarity across all time points, focusing on global patterns which tend to characterize rather broad and unspecific responses. It is widely believed that local patterns offer additional insight into the underlying intricate events leading to the overall observed behavior. Efficient biclustering algorithms have been devised for the discovery of temporally aligned local patterns in gene expression time series, but the extraction of time-lagged patterns remains a challenge due to the combinatorial explosion of pattern occurrence combinations when delays are considered. We present heuristic approaches enabling polynomial rather than exponential time solutions for the problem. ...

Regulatory Snapshots

Integrative Mining of Regulatory Modules from Expression Time Series and Regulatory Networks

Journal article (2012) - Joana P. Gonçalves, Ricardo S. Aires, Alexandre P. Francisco, Sara C. Madeira

Explaining regulatory mechanisms is crucial to understand complex cellular responses leading to system perturbations. Some strategies reverse engineer regulatory interactions from experimental data, while others identify functional regulatory units (modules) under the assumption that biological systems yield a modular organization. Most modular studies focus on network structure and static properties, ignoring that gene regulation is largely driven by stimulus-response behavior. Expression time series are key to gain insight into dynamics, but have been insufficiently explored by current methods, which often (1) apply generic algorithms unsuited for expression analysis over time, due to inability to maintain the chronology of events or incorporate time dependency; (2) ignore local patterns, abundant in most interesting cases of transcriptional activity; (3) neglect physical binding or lack automatic association of regulators, focusing mainly on expression patterns; or (4) limit the discovery to a predefined number of modules. We propose Regulatory Snapshots, an integrative mining approach to identify regulatory modules over time by combining transcriptional control with response, while overcoming the above challenges. Temporal biclustering is first used to reveal transcriptional modules composed of genes showing coherent expression profiles over time. Personalized ranking is then applied to prioritize prominent regulators targeting the modules at each time point using a network of documented regulatory associations and the expression data. Custom graphics are finally depicted to expose the regulatory activity in a module at consecutive time points (snapshots). Regulatory Snapshots successfully unraveled modules underlying yeast response to heat shock and human epithelial-to-mesenchymal transition, based on regulations documented in the YEASTRACT and JASPAR databases, respectively, and available expression data. Regulatory players involved in functionally enriched processes related to these biological events were identified. Ranking scores further suggested ability to discern the primary role of a gene (target or regulator). Prototype is available at: http://kdbio.inesc-id.pt/software/regulatorysnapshots. ...

Explaining regulatory mechanisms is crucial to understand complex cellular responses leading to system perturbations. Some strategies reverse engineer regulatory interactions from experimental data, while others identify functional regulatory units (modules) under the assumption that biological systems yield a modular organization. Most modular studies focus on network structure and static properties, ignoring that gene regulation is largely driven by stimulus-response behavior. Expression time series are key to gain insight into dynamics, but have been insufficiently explored by current methods, which often (1) apply generic algorithms unsuited for expression analysis over time, due to inability to maintain the chronology of events or incorporate time dependency; (2) ignore local patterns, abundant in most interesting cases of transcriptional activity; (3) neglect physical binding or lack automatic association of regulators, focusing mainly on expression patterns; or (4) limit the discovery to a predefined number of modules. We propose Regulatory Snapshots, an integrative mining approach to identify regulatory modules over time by combining transcriptional control with response, while overcoming the above challenges. Temporal biclustering is first used to reveal transcriptional modules composed of genes showing coherent expression profiles over time. Personalized ranking is then applied to prioritize prominent regulators targeting the modules at each time point using a network of documented regulatory associations and the expression data. Custom graphics are finally depicted to expose the regulatory activity in a module at consecutive time points (snapshots). Regulatory Snapshots successfully unraveled modules underlying yeast response to heat shock and human epithelial-to-mesenchymal transition, based on regulations documented in the YEASTRACT and JASPAR databases, respectively, and available expression data. Regulatory players involved in functionally enriched processes related to these biological events were identified. Ranking scores further suggested ability to discern the primary role of a gene (target or regulator). Prototype is available at: http://kdbio.inesc-id.pt/software/regulatorysnapshots.

Interactogeneous: Disease Gene Prioritization Using Heterogeneous Networks and Full Topology Scores

Journal article (2012) - Joana P. Goncalves, Alexandre P. Francisco, Yves Moreau, Sara C. Madeira

Disease gene prioritization aims to suggest potential implications of genes in disease susceptibility. Often accomplished in a guilt-by-association scheme, promising candidates are sorted according to their relatedness to known disease genes. Network-based methods have been successfully exploiting this concept by capturing the interaction of genes or proteins into a score. Nonetheless, most current approaches yield at least some of the following limitations: (1) networks comprise only curated physical interactions leading to poor genome coverage and density, and bias toward a particular source; (2) scores focus on adjacencies (direct links) or the most direct paths (shortest paths) within a constrained neighborhood around the disease genes, ignoring potentially informative indirect paths; (3) global clustering is widely applied to partition the network in an unsupervised manner, attributing little importance to prior knowledge; (4) confidence weights and their contribution to edge differentiation and ranking reliability are often disregarded. We hypothesize that network-based prioritization related to local clustering on graphs and considering full topology of weighted gene association networks integrating heterogeneous sources should overcome the above challenges. We term such a strategy Interactogeneous. We conducted cross-validation tests to assess the impact of network sources, alternative path inclusion and confidence weights on the prioritization of putative genes for 29 diseases. Heat diffusion ranking proved the best prioritization method overall, increasing the gap to neighborhood and shortest paths scores mostly on single source networks. Heterogeneous associations consistently delivered superior performance over single source data across the majority of methods. Results on the contribution of confidence weights were inconclusive. Finally, the best Interactogeneous strategy, heat diffusion ranking and associations from the STRING database, was used to prioritize genes for Parkinson’s disease. This method effectively recovered known genes and uncovered interesting candidates which could be linked to pathogenic mechanisms of the disease. ...

Disease gene prioritization aims to suggest potential implications of genes in disease susceptibility. Often accomplished in a guilt-by-association scheme, promising candidates are sorted according to their relatedness to known disease genes. Network-based methods have been successfully exploiting this concept by capturing the interaction of genes or proteins into a score. Nonetheless, most current approaches yield at least some of the following limitations: (1) networks comprise only curated physical interactions leading to poor genome coverage and density, and bias toward a particular source; (2) scores focus on adjacencies (direct links) or the most direct paths (shortest paths) within a constrained neighborhood around the disease genes, ignoring potentially informative indirect paths; (3) global clustering is widely applied to partition the network in an unsupervised manner, attributing little importance to prior knowledge; (4) confidence weights and their contribution to edge differentiation and ranking reliability are often disregarded. We hypothesize that network-based prioritization related to local clustering on graphs and considering full topology of weighted gene association networks integrating heterogeneous sources should overcome the above challenges. We term such a strategy Interactogeneous. We conducted cross-validation tests to assess the impact of network sources, alternative path inclusion and confidence weights on the prioritization of putative genes for 29 diseases. Heat diffusion ranking proved the best prioritization method overall, increasing the gap to neighborhood and shortest paths scores mostly on single source networks. Heterogeneous associations consistently delivered superior performance over single source data across the majority of methods. Results on the contribution of confidence weights were inconclusive. Finally, the best Interactogeneous strategy, heat diffusion ranking and associations from the STRING database, was used to prioritize genes for Parkinson’s disease. This method effectively recovered known genes and uncovered interesting candidates which could be linked to pathogenic mechanisms of the disease.

AliBiMotif

Integrating alignment and biclustering to unravel Transcription Factor Binding Sites in DNA sequences

Journal article (2012) - Joana P. Goncalves, Yves Moreau, Sara C. Madeira

Transcription Factors (TFs) control transcription by binding to specific sites in the promoter regions of the target genes, which can be modelled by structured motifs. In this paper we propose AliBiMotif, a method combining sequence alignment and a biclustering approach based on efficient string matching techniques using suffix trees to unravel approximately conserved sets of blocks (structured motifs) while straightforwardly disregarding non-conserved stretches in-between. The ability to ignore the width of non-conserved regions is a major advantage of the proposed method over other motif finders, as the lengths of the binding sites are usually easier to estimate than the separating distances. ...