Thomas Abeel | TU Delft Repository

Learning to Learn from Microbiome Data

Benchmarking Meta-Learning for Disease Classification on Microbiome Abundance Data

Master thesis (2025) - S. Ramezani (author) , Thomas Abeel (mentor) , C. Peng (mentor) , B.M. Cosma (mentor) , Jasmijn A. Baaijens (mentor) , C. Lofi (graduation committee member)

The human gut microbiome has emerged as a key player in health and disease, yet machine learning on microbiome data remains challenging due to its high dimensionality, sparsity, compositionality, and inter-study heterogeneity. Although classical and deep learning methods have dem ...

The human gut microbiome has emerged as a key player in health and disease, yet machine learning on microbiome data remains challenging due to its high dimensionality, sparsity, compositionality, and inter-study heterogeneity. Although classical and deep learning methods have demonstrated promise, they often require extensive labeled data, which is rarely available in microbiome research. In this thesis, we investigate whether meta-learning can address these challenges by enabling better generalization from small, heterogeneous microbiome datasets. Specifically, we benchmark Prototypical networks (Protonets), a metric-based, few-shot meta-learning algorithm, against strong classical baselines (Random Forests, XGBoost, and Multi-layer Perceptrons) for disease classification tasks across a selected number of gut microbiome studies. We introduce a unified benchmarking pipeline that standardizes preprocessing, dimensionality reduction, task construction, and evaluation across studies. A leave-one-study-out cross-validation strategy simulates realistic deployment scenarios where only a few labeled samples are available from a new cohort. Our experiments explore the impact of support set size and dimensionality reduction via principal component analysis. Results show that although Protonets offer a conceptually appealing approach for few-shot learning, they consistently underperform compared to Random Forests in classification accuracy. Statistical analyses confirm the significance of this performance gap, and embedding visualizations reveal limited class separation in the learned feature space. These findings suggest that, under the evaluated conditions, classical models like Random Forests remain the more robust choice for microbiome classification in low-data regimes. By offering a rigorous and reproducible evaluation, this work lays the foundation for further exploration of meta-learning in microbiome research and highlights both the potential and current limitations of learning to learn in this complex domain.

Embedding-Based Multi-Paradigm Protein Function Prediction in Prokaryotes

Master thesis (2025) - M. Hielkema (author) , T.E.P.M.F. Abeel (mentor)

In the past decade, protein functional prediction has dramatically shifted towards the usage of large language models (LLMs). In this research, we set out to improve upon the model of SAFPred, a model for prokaryote protein function prediction combining LLM embedding based sequen ...

A metagenomic investigation of antibiotic resistance in non-clinical environments

Doctoral thesis (2025) - S. Pillay (author) , Marcel JT Reinders (promotor) , TEPMF Abeel (promotor)

Antimicrobial resistance (AMR), termed a "silent pandemic" has caused 4.95 million deaths in 2019, with numbers expected to rise. AMR spans human, animal, and environmental sectors, requiring a One Health approach to address this multifaceted global challenge. This dissertation f ...

Antimicrobial resistance (AMR), termed a "silent pandemic" has caused 4.95 million deaths in 2019, with numbers expected to rise. AMR spans human, animal, and environmental sectors, requiring a One Health approach to address this multifaceted global challenge. This dissertation focuses on the under-represented non-clinical sectors and employs the use of metagenomic data to advance AMR research.

The primary focus in AMR research has been on clinical settings, overlooking animals and the environment and leaving data gaps in resource-limited regions. The world of AMR and metagenomic data is first introduced followed by an in-depth review of AMR in non-clinical sectors and the information metagenomic data can provide. The emphasis is on bioinformatic tools, databases, and workflows to support researchers utilising metagenomic data for AMR studies in these sectors.

Moving forward, the wastewater treatment process, including the neglected upstream and downstream freshwater systems, is examined, to assess the microbiome, resistome and mobilome at each stage. Specific differences within every wastewater treatment plant process sector and their role in AMR transmission are identified. Inspired by the natural baseline of antibiotic resistance in soil, a comparative study of the composition of the microbiome, resistome and mobilome in different soil types, from natural to rural soils, is then further presented. Given the limited information on resistance patterns and the effects of geographical and anthropogenic factors, the influence of antibiotic resistance in different soil types is then further explored.

The swine industry, as the largest consumer of antibiotics, raises concerns about the effects of antibiotic use on the gut microbiome of animals. Antibiotics can impact animal health and promote the transmission of AMR to other non-clinical sectors and humans. How antibiotic use affects the fecal microbiome of pigs raised with and without antibiotics is examined to understand the dynamics of antibiotic resistance in the swine industry.

The burden of AMR, particularly in low- and middle-income countries, where resources for infectious disease surveillance are limited, was the inspiration to propose a method for generating metagenomic data in-field and in resource-limited settings, offering a cost-effective solution for outbreak monitoring and pathogen detection.

The main goal of this dissertation is to highlight the under-represented sectors, their significant role in AMR and to promote global inclusivity.

“From iMage to Market”: Machine-Learning-Empowered Fruit Supply

Doctoral thesis (2025) - J. Wen (author) , Mathijs de Weerdt (promotor) , Thomas Abeel (promotor)

Artificial intelligence (AI) has become a widely discussed and transformative technology, with its adoption growing across industries to drive insights and impact. In this thesis, we explore how AI methods and algorithms can facilitate the operation of soft-fruit supply chains, u ...

Artificial intelligence (AI) has become a widely discussed and transformative technology, with its adoption growing across industries to drive insights and impact. In this thesis, we explore how AI methods and algorithms can facilitate the operation of soft-fruit supply chains, using strawberries as a case study.

The thesis begins by presenting the general background and various perspectives from related works on how AI and machine learning (ML) have been applied to address problems in agricultural or horticultural practices. This includes tasks that, while not directly optimizing supply strategies, still contribute to solving broader challenges. In a nutshell, this thesis categorizes the scope of study into three scales: the single-fruit scale, the greenhouse scale, and the market scale. Within each scale, we review the existing research, identify knowledge gaps, and introduce robust and applicable methodologies capable of dealing with real-world conditions.

Since no publicly available datasets met the requirements of the research plan, we established several datasets for research on the soft-fruit supply chain through collecting, annotating, and (pre-)processing data. These newly curated datasets not only support the research presented in this thesis but also lay a foundation for future research from various perspectives. Details about these datasets are introduced in Chapter 2. Moreover, we conceptualize the process of gathering longitudinal observations from growth monitoring images as a multiple object tracking (MOT) task. We named the image collection and their MOT annotations as ``The Growing Strawberries (GSD)''. The computer vision challenge that GSD brings are further benchmarked and discussed in Chapter 3. Following this, the core contributions of the thesis is presented from Chapter 3 to 6, each corresponding to a published paper or one currently under review. Finally, Chapter 7 summarizes the research findings, answering the research questions proposed in Chapter 1 and discussing the overall work of the thesis.

We discuss these contributions for each of the three mentioned scales separately:

At the fruit scale, we designed and analyzed novel methodologies to keep track of the fruit growths and to predict key properties, including both external characteristics like ripeness and internal qualities such as sweetness. For the ripeness, we propose to use appearance properties, mainly the hue, as an objective metric to quantify it. For the sweetness, we trained deep neural networks to perform non-destructive prediction using environmental and image data, individually and integrally.

Our employment of color analysis and ML models provides a non-destructive and generalizable manner that ensures consistency when upstream and downstream parties in a supply chain estimate the properties of fruits. Meanwhile, the models perform comparatively with laboratory benchmarks even under imperfect, outdoor data collection. We further demonstrated the model in a mobile app to further facilitate adoption in the field.

By benchmarking state-of-the-art MOT algorithms on GSD, we illustrated the new challenges that are brought by this use case: first, the MOT objects change appearance during the tracking due to their biological development, and second, sparse frame rates introduce irregular movements from image to image. We showcased how fruit properties, such as ripeness, change over its life cycle. The results not only provide quantitative measurements that describe the fruit's biological development, but also depict the pain points of current MOT algorithms' predictions. In the meantime, by quantifying these changes over the biological development, we also retrieval relevant information and datasets to support predictions of the changes.

At the greenhouse scale, we designed a framework that optimizes the timing of fruit harvesting by integrating the aforementioned quantified changes over biological development, based on sequential demands about the desired quantities to be harvested. Essentially, the framework makes fruit-specific decisions on dates of harvests by leveraging the monitoring data. The decisions are thus made to enhance both current and future demand-fulfillment capabilities. At each stage of this framework, we evaluated various methods and discussed their effectiveness in achieving the stage targets. For example, how to process the infield data to achieve coherent functions about the ripeness development, how to predict future changes, how to include different perspectives in the optimization model, and etc. As the decisions are made for each specific fruit, the work also demonstrates significant potential for integration with mobile apps and harvesting robots. On top of that, the information retrieval function can also serve as a standalone application to provide objective fruit-level quality assessment.

At the market scale, we focus on the portfolio optimization of a grower under a widely applied mechanism of the market system: the majority of demands for harvests are predetermined through advance contracts, which also serves as an a priori condition of the solution proposed at the greenhouse level. The local market, with dynamic prices and demands, can be used to save losses from the difference in contracted demands and the actual yield. To mitigate outlying decision failures, we introduced the ``smart predict-then-optimize (SPO)'' method, which trains models to predict future yield and local market prices.
Our results illustrate that SPO loss primarily affects the bias layer in neural networks, contrasting with models trained using mean squared error (MSE). This difference essentially leads to more conservative estimations in decision-making scenarios, and also motivates and highlights the importance of effective MSE-based pre-training. Additionally, our study reveals how SPO loss makes models interact when multiple neural networks are trained to predict decision parameters with diverse functions. This insight expands the applicability of SPO loss across a broader range of use cases and model architectures, underscoring its contribution to the field of decision-focused learning.

In conclusion, this thesis introduces diverse data-driven methodologies to tackle the distinct tasks involved in optimizing fruit supply, using strawberries as a case study. Central to our approach is the effective utilization of data, which serves as the foundation for solutions that span from fruit-level evaluations to market-level planning. By leveraging analytics of non-destructive data, our solutions provide objective estimations of fruit quality, fostering a more consistent shared understanding between sellers and buyers while reducing potential food waste. Overall, these advancements push the boundaries of AI in supporting decision-making during the supply of soft fruits, particularly for smaller growers. The findings not only empower more efficient and sustainable supply chain operations but also highlight the strong potential for many practical real-world applications.

Characterizing bacterial genetic diversity

In species' pangenomes and microbial communities

Doctoral thesis (2025) - L.R. van Dijk (author) , Marcel J.T. Reinders (promotor) , TEPMF Abeel (promotor)

Bacteria are everywhere and play essential roles in Earth's diverse ecosystems and human health. For example, humans harbor a complex and essential gut microbial community comprising thousands of bacterial species (in addition to numerous viruses, fungi, and microbial eukaryotes) ...

Bacteria are everywhere and play essential roles in Earth's diverse ecosystems and human health. For example, humans harbor a complex and essential gut microbial community comprising thousands of bacterial species (in addition to numerous viruses, fungi, and microbial eukaryotes). This community helps break down and synthesize nutrients, trains the immune system, and keeps pathogens at bay. However, imbalances in this community are associated with several diseases, including obesity, inflammatory bowel syndrome, and recurrent urinary tract infections. Moreover, bacteria can cause deadly infections, and many are developing resistance to our most potent antibiotics. Studying bacteria is thus essential for identifying differences between pathogens and harmless commensals, countering antimicrobial resistance, and understanding their impacts on human health.

To study bacteria, we typically characterize and compare their genomes. The genome comprises all of an organism's hereditary information, and its genes encode the molecular machines necessary for cell function, providing an overview of an organism's capabilities. The hereditary information in genomes additionally enables inferring the organism's evolutionary history, which helps in understanding why specific traits evolved or aid in inferring transmission links in case of an outbreak.

A challenge with comparing large sets of bacterial genomes is the extensive variation in genome content among many species. For example, two Escherichia coli strains can share as little as 50% of their genes. Current computational tools offer biased or incomplete views of genetic variation among strains. This hinders the identification of genotype-phenotype associations, prevents tracking mobile genetic elements, and limits our understanding of the microbial communities they are part of.

The central question of this thesis is how to design computational tools that enable accurate characterization of genetic variation among diverse bacterial genomes. This thesis introduces new algorithms to identify and represent genetic variation using graph data structures. It additionally presents a tool that characterizes strain-specific genetic variation in microbial communities, even in the presence of same-species strain mixtures. Finally, this thesis uses the previously mentioned tools to investigate the role of the gut microbiome in women with recurrent urinary tract infections, offering novel insights into the gut and bladder dynamics of E. coli.

Collectively, we expect this work to contribute to an improved mechanistic understanding of bacteria's role in human health, help track and counter the spread of antimicrobial resistance, and inform on the development of microbiome-mediated therapeutics.

Modified GNN-SubNet: leveraging local versus global Graph Neural Network explanations for disease subnetwork detection

Bachelor thesis (2024) - E. Milchi (author) , M. Khosla (mentor) , Jana M. Weber (mentor) , Thomas Abeel (coach)

As graph neural networks (GNNs) become more frequently used in the biomedical field, there is a growing need to provide insight into how their predictions are made. An algorithm that does this is GNN-SubNet, developed with the aim of detecting disease subnetworks in protein-prote ...

The artificially generated microbiome

A study on the generation and potential use cases of predicted meta-omics data

Master thesis (2024) - B.M. Cosma (author) , T.E.P.M.F. Abeel (mentor) , Gosia Migut (graduation committee member) , Stephanie Pillay (mentor) , David Calderón-Franco (mentor)

Motivation: Imbalances in the human gut microbiome have been linked to various conditions, including inflammatory bowel disease (IBD), diabetes, and mental health disorders. While metagenomics and amplicon sequencing are the most commonly used technologies to characterize ...

Influence of molecular structures on graph neural network explainers' performance

Bachelor thesis (2024) - T.N. Stols (author) , M. Khosla (mentor) , Jana M. Weber (mentor) , Thomas Abeel (coach)

This study evaluates how the explainer for a Graph Neural Network creates explanations for chemical property prediction tasks. Explanations are masks over input molecules that indicate the importance of atoms and bonds toward the model output. Although these explainers have bee ...

Analysis of the effect of conserved regions on bacterial plasmid host range

Master thesis (2024) - C.K. Schilder (author) , T.E.P.M.F. Abeel (mentor) , Marco Teixeira (graduation committee member)

Background: Genetic information is shared between different bacteria through mobile genetic elements, among which plasmids. Some plasmids are able to transfer and spread genetic information between different species. Understanding which genes allow plasmids to replicate in differ ...

Optimizing strains in Metabolic Engineering: comparative analysis of β-Conditional Variational Auto-encoder and Probabilistic PCA for synthetic data generation

Bachelor thesis (2024) - U.D. Kirbeyi (author) , T.E.P.M.F. Abeel (mentor) , Paul van Lent (mentor) , A Hanjalic (graduation committee member)

This research explores the landscape of dataset generation through the lens of Probabilistic Principal Component Analysis (PPCA) and β-Conditional Variational Auto-encoder (β-CVAE) models. We conduct a comparative analysis of their respective capabilities in reproducing datasets ...

Synthetic data generation for the optimization of strains in metabolic engineering using latent space representations derived from a Conditional Variational Autoencoder

Bachelor thesis (2024) - N.M. Alwani (author) , T.E.P.M.F. Abeel (mentor) , P.H. van Lent (mentor) , Alan Hanjalic (graduation committee member)

This study investigates the application of generative models for synthetic data generation in pathway optimization experiments within the field of metabolic engineering. Conditional Variational Autoencoders (CVAEs) use neural networks and latent variable distributions to generate ...

Synthetic data generation for the optimization of strains in metabolic engineering using generative adversarial networks

Bachelor thesis (2024) - M.W. Jarosz (author) , T.E.P.M.F. Abeel (mentor) , Paul van Lent (mentor) , Alan Hanjalic (graduation committee member)

This research investigates the application of Generative Adversarial Networks (GANs) and probabilistic Principal Component Analysis (PPCA) in generating synthetic data for pathway optimization in metabolic engineering. The study aims to compare the performance of these generative ...

The Microbial World Magnified Through the Bright Lens of Comparative Genomics

Doctoral thesis (2024) - A. Urhan (author) , Marcel J.T. Reinders (promotor) , TEPMF Abeel (promotor)

We are witnessing an era of rapid technological advancements, which led to an explosion in the amount of genomic data collected. The field of comparative genomics, in parallel, is expanding at an unrepentant rate. Comparative genomics explores the similarities and differences in ...

We are witnessing an era of rapid technological advancements, which led to an explosion in the amount of genomic data collected. The field of comparative genomics, in parallel, is expanding at an unrepentant rate. Comparative genomics explores the similarities and differences in the genomes of various organisms, species or strains, and it is one of our most useful tools today for unraveling the complexities of microbial biology. However, despite growing interest in microbial genomics, there remains a significant gap in our understanding of microbial diversity and function. The microbial dark matter remains elusive, and we have a lot more to uncover.
This dissertation aims to leverage comparative genomics, and develop novel algorithms tailored for microbial genomes to enhance our understanding of microbial biology and address existing knowledge gaps. More specifically, it focuses on the representation of microbial diversity and the functional annotation of poorly characterized taxa. By harnessing large-scale genomic datasets, novel approaches and algorithms are designed to uncover hidden traits in microorganisms.
We begin our journey at the smallest scale with viruses; our study of SARS-CoV-2 genomes in the Netherlands during the COVID-19 pandemic showcases the power of genomic data to understand disease dynamics. The remainder of the dissertation concerns bacteria. We explore pangenome graphs to represent bacterial populations. As I discuss the limitations of current methods, I propose an ensemble approach to exploit graph representations for structural variant calling. This work sets the stage for future developments in pangenome graphs as a powerful framework to model bacterial populations and analyze their genetic makeup.
Following recent developments in algorithms for eukaryotes, I draw inspiration from natural language processing to predict gene functions in bacteria. I present SAFPred, a novel tool in which I integrate bacterial synteny into the predictive model, and demonstrate its use to identify variants of toxin genes in Enterococcus. The novelty of my approach lies partly in how I incorporated bacterial synteny into the function prediction algorithm. Thus, I also release our synteny database, SAFPredDB, that can facilitate various comparative genomic analyses in the future. Our journey comes to an end in our study of the Enterococcus genus through the largest collection of genome assemblies. Here, I emphasize the importance of understanding microbial diversity and antibiotic resistance mechanisms once again, and note the power of large scale genomic analyses.
Overall, my main goal with this dissertation is to showcase the potential of comparative genomics in unraveling the mysteries of microbial life and addressing pressing global challenges in health, agriculture, and biotechnology. Through innovative methods and large-scale data analysis, my work, first and foremost, offers valuable insights into microbial biology and evolution, paving the way for future research in the field. And I hope it also encourages further exploration and appreciation of the mighty world of microbes.

Finding biological markers for Parkinson's disease

Using machine learning to analyse metagenomic data

Bachelor thesis (2023) - M.L. Koning (author) , E.A. van der Toorn (mentor) , D. Calderón-Franco (mentor) , T.E.P.M.F. Abeel (mentor) , T. Höllt (graduation committee member)

Parkinson's disease (PD) is a neurodegenerative disorder characterized by motor function loss and potential mental and behavioral changes. The identification of biomarkers in the gut microbiota of PD patients can significantly aid in fast and accurate diagnosis. This study invest ...

Identifying biological markers in the gut microbiome associated with celiac disease using machine learning

Bachelor thesis (2023) - P. Persianov (author) , T.E.P.M.F. Abeel (mentor) , E.A. van der Toorn (mentor) , David Calderón-Franco (mentor) , Thomas Höllt (graduation committee member)

Celiac disease is a genetic autoimmune disorder caused by a negative reaction to gluten associated with alterations in the gut microbiome. This study explored the potential of machine learning models and feature selection methods in identifying biomarkers for celiac disease using ...

Finding biological markers for the prediction of colorectal cancer

Using machine learning methods to identify functional biomarkers in the human gut microbiome

Bachelor thesis (2023) - A.J.G. Sloof (author) , T.E.P.M.F. Abeel (mentor) , E.A. van der Toorn (mentor) , David Calderón-Franco (mentor) , Thomas Höllt (graduation committee member)

Colorectal cancer (CRC), one of the leading causes of mortality, is challenging to diagnose. By using metagenomic analysis with machine learning methods, this can be done in a non-invasive manner. In this research, a neural network has been trained on relative pathway abundance d ...

Finding Biomarkers for Type 2 Diabetes

Bachelor thesis (2023) - A. Das (author) , T.E.P.M.F. Abeel (mentor) , E.A. van der Toorn (mentor) , D. Calderón-Franco (mentor) , T. Höllt (graduation committee member)

Type 2 Diabetes is a very prevalent disease in current times and leads to significant adverse effects. Recently, there has been a growing interest in the association of the human gut microbiome with respect to chronic diseases like Type 2 Diabetes with the aim to identify biomark ...

Finding Biomarkers for Schizophrenia

Can Machine Learning algorithms identify schizophrenia-related biomarkers within metagenomic data derived from the human gut microbiome?

Bachelor thesis (2023) - T.M. Bastow (author) , E.A. van der Toorn (mentor) , D. Calderón-Franco (mentor) , T.E.P.M.F. Abeel (mentor) , T. Höllt (graduation committee member)

There is mounting evidence indicating a relation- ship between the gut microbiome composition and the development of mental diseases but the mech- anisms remain unclear. Shotgun sequenced data from 90 schizophrenic patients and 81 sex, age, weight, and location matched controls w ...

Using metagenomic Hi-C data to discover broad host-range plasmids conferring antimicrobial resistance

Master thesis (2023) - E. Dorrestijn (author) , Thomas Abeel (mentor) , Stephanie Pillay (graduation committee member) , M. Skrodzki (graduation committee member)

Horizontal gene transfer (HGT) trough plasmids is one of the main contributors to the rapid increase of antimicrobial resistance (AMR). Studying wastewater from wastewater treatment plants (WWTPs) allows us new insights into HGT as bacteria from different sources come together. C ...

Explainable AI for metabolic engineering

Master thesis (2023) - T.R.D. van Graft (author) , Thomas Abeel (mentor) , Paul van Lent (coach)

Metabolic engineering is an important field in biotechnology, aimed at optimizing cellular processes to produce desired compounds. In this thesis, we focus on predicting the metabolome from the proteome, as understanding this relationship is crucial for understanding cellular met ...