G.A. Bouland | TU Delft Repository

Computational Approaches to Deciphering the Molecular and Cellular Heterogeneity of Alzheimer's Disease

Doctoral thesis (2025) - G.A. Bouland, M.J.T. Reinders, A.M.E.T.A. Mahfouz

In summary, the contributions within this thesis advance Alzheimer's research by introducing new computational tools and methods to better understand the genetics of the disease and cellular mechanisms. Additionally, showing that single-cell gene expression can be effectively analyzed in a binary format (expressed or not) simplifies genomic data analysis, making it more accessible, efficient, and applicable to a range of diseases and conditions. ...

An omics-based machine learning approach to predict diabetes progression

A RHAPSODY study

Journal article (2024) - Roderick C. Slieker, Magnus Münch, Louise A. Donnelly, Gerard A. Bouland, Iulian Dragan, Dmitry Kuznetsov, Petra J.M. Elders, Guy A. Rutter, Mark Ibberson, More Authors...

Aims/hypothesis: People with type 2 diabetes are heterogeneous in their disease trajectory, with some progressing more quickly to insulin initiation than others. Although classical biomarkers such as age, HbA_1c and diabetes duration are associated with glycaemic progression, it is unclear how well such variables predict insulin initiation or requirement and whether newly identified markers have added predictive value. Methods: In two prospective cohort studies as part of IMI-RHAPSODY, we investigated whether clinical variables and three types of molecular markers (metabolites, lipids, proteins) can predict time to insulin requirement using different machine learning approaches (lasso, ridge, GRridge, random forest). Clinical variables included age, sex, HbA_1c, HDL-cholesterol and C-peptide. Models were run with unpenalised clinical variables (i.e. always included in the model without weights) or penalised clinical variables, or without clinical variables. Model development was performed in one cohort and the model was applied in a second cohort. Model performance was evaluated using Harrel’s C statistic. Results: Of the 585 individuals from the Hoorn Diabetes Care System (DCS) cohort, 69 required insulin during follow-up (1.0–11.4 years); of the 571 individuals in the Genetics of Diabetes Audit and Research in Tayside Scotland (GoDARTS) cohort, 175 required insulin during follow-up (0.3–11.8 years). Overall, the clinical variables and proteins were selected in the different models most often, followed by the metabolites. The most frequently selected clinical variables were HbA_1c (18 of the 36 models, 50%), age (15 models, 41.2%) and C-peptide (15 models, 41.2%). Base models (age, sex, BMI, HbA_1c) including only clinical variables performed moderately in both the DCS discovery cohort (C statistic 0.71 [95% CI 0.64, 0.79]) and the GoDARTS replication cohort (C 0.71 [95% CI 0.69, 0.75]). A more extensive model including HDL-cholesterol and C-peptide performed better in both cohorts (DCS, C 0.74 [95% CI 0.67, 0.81]; GoDARTS, C 0.73 [95% CI 0.69, 0.77]). Two proteins, lactadherin and proto-oncogene tyrosine-protein kinase receptor, were most consistently selected and slightly improved model performance. Conclusions/interpretation: Using machine learning approaches, we show that insulin requirement risk can be modestly well predicted by predominantly clinical variables. Inclusion of molecular markers improves the prognostic performance beyond that of clinical variables by up to 5%. Such prognostic models could be useful for identifying people with diabetes at high risk of progressing quickly to treatment intensification. Data availability: Summary statistics of lipidomic, proteomic and metabolomic data are available from a Shiny dashboard at https://rhapdata-app.vital-it.ch. Graphical Abstract: (Figure presented.). ...

Aims/hypothesis: People with type 2 diabetes are heterogeneous in their disease trajectory, with some progressing more quickly to insulin initiation than others. Although classical biomarkers such as age, HbA_1c and diabetes duration are associated with glycaemic progression, it is unclear how well such variables predict insulin initiation or requirement and whether newly identified markers have added predictive value. Methods: In two prospective cohort studies as part of IMI-RHAPSODY, we investigated whether clinical variables and three types of molecular markers (metabolites, lipids, proteins) can predict time to insulin requirement using different machine learning approaches (lasso, ridge, GRridge, random forest). Clinical variables included age, sex, HbA_1c, HDL-cholesterol and C-peptide. Models were run with unpenalised clinical variables (i.e. always included in the model without weights) or penalised clinical variables, or without clinical variables. Model development was performed in one cohort and the model was applied in a second cohort. Model performance was evaluated using Harrel’s C statistic. Results: Of the 585 individuals from the Hoorn Diabetes Care System (DCS) cohort, 69 required insulin during follow-up (1.0–11.4 years); of the 571 individuals in the Genetics of Diabetes Audit and Research in Tayside Scotland (GoDARTS) cohort, 175 required insulin during follow-up (0.3–11.8 years). Overall, the clinical variables and proteins were selected in the different models most often, followed by the metabolites. The most frequently selected clinical variables were HbA_1c (18 of the 36 models, 50%), age (15 models, 41.2%) and C-peptide (15 models, 41.2%). Base models (age, sex, BMI, HbA_1c) including only clinical variables performed moderately in both the DCS discovery cohort (C statistic 0.71 [95% CI 0.64, 0.79]) and the GoDARTS replication cohort (C 0.71 [95% CI 0.69, 0.75]). A more extensive model including HDL-cholesterol and C-peptide performed better in both cohorts (DCS, C 0.74 [95% CI 0.67, 0.81]; GoDARTS, C 0.73 [95% CI 0.69, 0.77]). Two proteins, lactadherin and proto-oncogene tyrosine-protein kinase receptor, were most consistently selected and slightly improved model performance. Conclusions/interpretation: Using machine learning approaches, we show that insulin requirement risk can be modestly well predicted by predominantly clinical variables. Inclusion of molecular markers improves the prognostic performance beyond that of clinical variables by up to 5%. Such prognostic models could be useful for identifying people with diabetes at high risk of progressing quickly to treatment intensification. Data availability: Summary statistics of lipidomic, proteomic and metabolomic data are available from a Shiny dashboard at https://rhapdata-app.vital-it.ch. Graphical Abstract: (Figure presented.).

Consequences and opportunities arising due to sparser single-cell RNA-seq datasets

Journal article (2023) - Gerard A. Bouland, Ahmed Mahfouz, Marcel J.T. Reinders

With the number of cells measured in single-cell RNA sequencing (scRNA-seq) datasets increasing exponentially and concurrent increased sparsity due to more zero counts being measured for many genes, we demonstrate here that downstream analyses on binary-based gene expression give similar results as count-based analyses. Moreover, a binary representation scales up to ~ 50-fold more cells that can be analyzed using the same computational resources. We also highlight the possibilities provided by binarized scRNA-seq data. Development of specialized tools for bit-aware implementations of downstream analytical tasks will enable a more fine-grained resolution of biological heterogeneity. ...

Identifying Aging and Alzheimer Disease–Associated Somatic Variations in Excitatory Neurons From the Human Frontal Cortex

Journal article (2023) - M. Zhang, G.A. Bouland, H. Holstege, M.J.T. Reinders

Background and Objectives With age, somatic mutations accumulated in human brain cells can lead to various neurologic disorders and brain tumors. Because the incidence rate of Alzheimer disease (AD) increases exponentially with age, investigating the association between AD and the accumulation of somatic mutation can help understand the etiology of AD. Methods We designed a somatic mutation detection workflow by contrasting genotypes derived from whole-genome sequencing (WGS) data with genotypes derived from scRNA-seq data and applied this workflow to 76 participants from the Religious Order Study and the Rush Memory and Aging Project (ROSMAP) cohort. We focused only on excitatory neurons, the dominant cell type in the scRNA-seq data. Results We identified 196 sites that harbored at least 1 individual with an excitatory neuron–specific somatic mutation (ENSM), and these 196 sites were mapped to 127 genes. The single base substitution (SBS) pattern of the putative ENSMs was best explained by signature SBS5 from the Catalogue of Somatic Mutations in Cancer (COSMIC) mutational signatures, a clock-like pattern correlating with the age of the individual. The count of ENSMs per individual also showed an increasing trend with age. Among the mutated sites, we found 2 sites tend to have more mutations in older individuals (16:6899517 [RBFOX1], p = 0.04; 4:21788463 [KCNIP4], p < 0.05). In addition, 2 sites were found to have a higher odds ratio to detect a somatic mutation in AD samples (6:73374221 [KCNQ5], p = 0.01 and 13:36667102 [DCLK1], p = 0.02). Thirty-two genes that harbor somatic mutations unique to AD and the KCNQ5 and DCLK1 genes were used for gene ontology (GO)–term enrichment analysis. We found the AD-specific ENSMs enriched in the GO-term “vocalization behavior” and “intraspecies interaction between organisms.” Of interest we observed both age-specific and AD-specific ENSMs enriched in the K ⁺ channel–associated genes. Discussion Our results show that combining scRNA-seq and WGS data can successfully detect putative somatic mutations. The putative somatic mutations detected from ROSMAP data set have provided new insights into the association of AD and aging with brain somatic mutagenesis. ...

Background and Objectives With age, somatic mutations accumulated in human brain cells can lead to various neurologic disorders and brain tumors. Because the incidence rate of Alzheimer disease (AD) increases exponentially with age, investigating the association between AD and the accumulation of somatic mutation can help understand the etiology of AD. Methods We designed a somatic mutation detection workflow by contrasting genotypes derived from whole-genome sequencing (WGS) data with genotypes derived from scRNA-seq data and applied this workflow to 76 participants from the Religious Order Study and the Rush Memory and Aging Project (ROSMAP) cohort. We focused only on excitatory neurons, the dominant cell type in the scRNA-seq data. Results We identified 196 sites that harbored at least 1 individual with an excitatory neuron–specific somatic mutation (ENSM), and these 196 sites were mapped to 127 genes. The single base substitution (SBS) pattern of the putative ENSMs was best explained by signature SBS5 from the Catalogue of Somatic Mutations in Cancer (COSMIC) mutational signatures, a clock-like pattern correlating with the age of the individual. The count of ENSMs per individual also showed an increasing trend with age. Among the mutated sites, we found 2 sites tend to have more mutations in older individuals (16:6899517 [RBFOX1], p = 0.04; 4:21788463 [KCNIP4], p < 0.05). In addition, 2 sites were found to have a higher odds ratio to detect a somatic mutation in AD samples (6:73374221 [KCNQ5], p = 0.01 and 13:36667102 [DCLK1], p = 0.02). Thirty-two genes that harbor somatic mutations unique to AD and the KCNQ5 and DCLK1 genes were used for gene ontology (GO)–term enrichment analysis. We found the AD-specific ENSMs enriched in the GO-term “vocalization behavior” and “intraspecies interaction between organisms.” Of interest we observed both age-specific and AD-specific ENSMs enriched in the K ⁺ channel–associated genes. Discussion Our results show that combining scRNA-seq and WGS data can successfully detect putative somatic mutations. The putative somatic mutations detected from ROSMAP data set have provided new insights into the association of AD and aging with brain somatic mutagenesis.

Differential analysis of binarized single-cell RNA sequencing data captures biological variation

Journal article (2021) - Gerard A. Bouland, Ahmed Mahfouz, Marcel J.T. Reinders

Single-cell RNA sequencing data is characterized by a large number of zero counts, yet there is growing evidence that these zeros reflect biological variation rather than technical artifacts. We propose to use binarized expression profiles to identify the effects of biological variation in single-cell RNA sequencing data. Using 16 publicly available and simulated datasets, we show that a binarized representation of single-cell expression data accurately represents biological variation and reveals the relative abundance of transcripts more robustly than counts. ...