Thomas Abeel
Please Note
61 records found
1
Predicted meta-omics
A potential solution to multi-omics data scarcity in microbiome studies
AILMENT
A novel ML framework for prediction and analysis of microbiota associations in colorectal cancer
Although biofilms are widespread in nature, the ecological roles and compositional diversity of the extracellular polymeric substances (EPS) forming these structures remain poorly understood. Here, we apply a bottom-up genomic approach by investigating the biosynthetic potential for glycan precursors in the genus “Candidatus Accumulibacter”, with a focus on assessing the intra-genus variability. Within a curated set of 61 “Ca. Accumulibacter” MAGs, our analysis revealed a dichotomy in glycan precursors between a conserved core group of 9 nucleotide-sugars and a variable accessory set of 12 nucleotide-sugars, out of 50 nucleotide-sugars tested. The core nucleotide-sugars in “Ca. Accumulibacter” are related to nucleotide-sugars also found to be widely distributed across the tree of life, whereas the accessory set is enriched in rare nucleotide-sugars. The accessory nucleotide-sugars show an irregular distribution across “Ca. Accumulibacter” phylogeny, and divergent evolutionary histories. This highlights the possibility that distinct evolutionary pressures act on different parts of the EPS-formation metabolism, leading to genotypic diversification driven by complex biological phenomena such as horizontal gene transfer that support the observed divergent evolutionary histories.
Circling in on plasmids
Benchmarking plasmid detection and reconstruction tools for short-read data from diverse species
The ability to detect and reconstruct plasmids from genome assemblies is crucial for studying the evolution and spread of antimicrobial resistance and virulence in bacteria. Though long-read sequencing technologies have made reconstructing plasmids easier, most (97%) of the bacterial genome assemblies in the public domain are generated from short-read data. Work to compare plasmid reconstruction tools has focused primarily on Escherichia coli, leaving gaps in our understanding of how well these tools perform on other, less well-characterized, taxa. Using high-quality assemblies as ground truth, we benchmarked 12 plasmid detection tools (which identify plasmid contigs in assemblies) and four plasmid reconstruction tools (which group contigs from the same plasmid together). We tested their ability to characterize diverse plasmids from short-read assemblies representing a wide range of Enterobacterales and Enterococcus species, including newly discovered and poorly characterized species collected from nonhuman hosts. Plasmer, PlasmidEC, PlaScope, and gplas2 were the highest-scoring plasmid detection tools, performing well for both Enterobacterales and enterococci. The two major determinants of accurate plasmid detection were representation in plasmid databases—with Enterobacterales plasmids being more easily detected than those from enterococci—and assembly contiguity, which was also key for successful plasmid reconstruction. Gplas2 performed best for plasmid reconstruction; however, less than half of plasmids were perfectly reconstructed, suggesting that substantial room for improvement remains in this class of tools.
Smart “predict, then optimize” (SPO) (Elmachtoub in Manag Sci 68(1): 9–26, 2022) is an end-to-end learning strategy for models that predict parameters in optimization problems. Unlike minimizing mean squared error (MSE) which cares about prediction accuracies, SPO aims to ensure that predictions lead to the best possible decisions. The associated loss function, termed SPO loss, measures the decision’s regret from optimal outcomes with parameter realizations. Existing literature has demonstrated the viability of SPO, however, these studies often focus on classical optimization problems and employ a limited set of models for benchmarking. In this study, we tackled a decision-making task inspired by real-world challenges across a wide range of neural network models. Unlike classical problems, our task requires a unique approach: collaboratively training two models to predict different variables. On top of that, one of the decision variables also affects the feasibility of the decisions, further increasing the complexity. While our implementation validates the benefits of SPO, we were surprised to find that models trained exclusively on SPO loss do not consistently attain the minimum regret. Our further investigation into hyperparameters illustrates that the well-tuned models learned very similar patterns from the feature set, irrespective of whether MSE or SPO loss was used. In other words, the change from MSE to SPO loss in training primarily affected the layer biases. Therefore, to improve the learning efficacy with SPO loss, we propose prioritizing learning feature patterns as the fundamental step. Possible strategies include using specialized neural network layers to capture deeper patterns more effectively or simply warming up by training with MSE. Specifically, a warming-up process is particularly advantageous for model(s) where the outputs are closely tied to constraints, as their prediction accuracy significantly impacts the decision feasibility. The insights are investigated empirically through two real-world trading scenarios. By leveraging datasets with diverse properties, we demonstrate the novelty and generalizability of our investigation.
Partial order alignment is a widely used method for computing multiple sequence alignments, with applications in genome assembly and pangenomics, among many others. Current algorithms to compute the optimal, gap-affine partial order alignment do not scale well to larger graphs and sequences. While heuristic approaches exist, they do not guarantee optimal alignment and sacrifice alignment accuracy.
Results
We present POASTA, a new optimal algorithm for partial order alignment that exploits long stretches of matching sequence between the graph and a query. We benchmarked POASTA against the state-of-the-art on several diverse bacterial gene datasets and demonstrated an average speed-up of 4.1x and up to 9.8x, using less memory. POASTA’s memory scaling characteristics enabled the construction of much larger POA graphs than previously possible, as demonstrated by megabase-length alignments of 342 Mycobacterium tuberculosis sequences. ...
Partial order alignment is a widely used method for computing multiple sequence alignments, with applications in genome assembly and pangenomics, among many others. Current algorithms to compute the optimal, gap-affine partial order alignment do not scale well to larger graphs and sequences. While heuristic approaches exist, they do not guarantee optimal alignment and sacrifice alignment accuracy.
Results
We present POASTA, a new optimal algorithm for partial order alignment that exploits long stretches of matching sequence between the graph and a query. We benchmarked POASTA against the state-of-the-art on several diverse bacterial gene datasets and demonstrated an average speed-up of 4.1x and up to 9.8x, using less memory. POASTA’s memory scaling characteristics enabled the construction of much larger POA graphs than previously possible, as demonstrated by megabase-length alignments of 342 Mycobacterium tuberculosis sequences.
Jaxkineticmodel
Neural ordinary differential equations inspired parameterization of kinetic models
Motivation: Metabolic kinetic models are widely used to model biological systems. Despite their widespread use, it remains challenging to parameterize these Ordinary Differential Equations (ODE) for large scale kinetic models. Recent work on neural ODEs has shown the potential for modeling time-series data using neural networks, and many methodological developments in this field can similarly be applied to kinetic models. Results: We have implemented a simulation and training framework for Systems Biology Markup Language (SBML) models using JAX/Diffrax, which we named jaxkineticmodel. JAX allows for automatic differentiation and just-in-time compilation capabilities to speed up the parameterization of kinetic models, while also allowing for hybridizing kinetic models with neural networks. We show the robust capabilities of training kinetic models using this framework on a large collection of SBML models with different degrees of prior information on parameter initialization. We furthermore showcase the training framework implementation on a complex model of glycolysis. Finally, we show an example of hybridizing kinetic model with a neural network if a reaction mechanism is unknown. These results show that our framework can be used to fit large metabolic kinetic models efficiently and provides a strong platform for modeling biological systems. Implementation: Implementation of jaxkineticmodel is available as a Python package at https://github.com/AbeelLab/jaxkineticmodel.
The Growing Strawberries Dataset
Tracking Multiple Objects with Biological Development over an Extended Period
Multiple Object Tracking (MOT) is a rapidly developing research field that targets precise and reliable tracking of objects. Unfortunately, most available MOT datasets typically contain short video clips only, disregarding the indispensable requirement for adequately capturing substantial long-term variations in real-world scenarios. Long-term MOT poses unique challenges due to changes in both the objects and the environment, which remain relatively unexplored. To fill the gap, we propose a time-lapse image dataset inspired by the growth monitoring of strawberries, dubbed The Growing Strawberries Dataset (GSD). The data was captured hourly by six cameras, covering a span of 16 months in 2021 and 2022. During this time, it encompassed a total of 24 plants in two separate greenhouses. The changes in appearance, weight, and position during the ripening process, along with variations in the illumination during data collection, distinguish the task from previous MOT research. These practical issues resulted in a drastic performance downgrade in the track identification and association tasks of state-of-the-art MOT algorithms. We believe The Growing Strawberries will provide a platform for evaluating such long-term MOT tasks and inspire future research. The dataset is available at https://doi.org/10.4121/e3b31ece-cc88-4638-be10-8ccdd4c5f2f7.v1.
Enterococci are gut microbes of most land animals. Likely appearing first in the guts of arthropods as they moved onto land, they diversified over hundreds of millions of years adapting to evolving hosts and host diets. Over 60 enterococcal species are now known. Two species, Enterococcus faecalis and Enterococcus faecium, are common constituents of the human microbiome. They are also now leading causes of multidrug-resistant hospital-associated infection. The basis for host association of enterococcal species is unknown. To begin identifying traits that drive host association, we collected 886 enterococcal strains from widely diverse hosts, ecologies, and geographies. This identified 18 previously undescribed species expanding genus diversity by >25%. These species harbor diverse genes including toxins and systems for detoxification and resource acquisition. Enterococcus faecalis and E. faecium were isolated from diverse hosts highlighting their generalist properties. Most other species showed a more restricted distribution indicative of specialized host association. The expanded species diversity permitted the Enterococcus genus phylogeny to be viewed with unprecedented resolution, allowing features to be identified that distinguish its four deeply rooted clades, and the entry of genes associated with range expansion such as B-vitamin biosynthesis and flagellar motility to be mapped to the phylogeny. This work provides an unprecedentedly broad and deep view of the genus Enterococcus, including insights into its evolution, potential new threats to human health, and where substantial additional enterococcal diversity is likely to be found.
Nitrous oxide (N2O) is a potent greenhouse gas of primarily microbial origin. Oxic and anoxic emissions are commonly ascribed to autotrophic nitrification and heterotrophic denitrification, respectively. Beyond this established dichotomy, we quantitatively show that heterotrophic denitrification can significantly contribute to aerobic nitrogen turnover and N2O emissions in complex microbiomes exposed to frequent oxic/anoxic transitions. Two planktonic, nitrification-inhibited enrichment cultures were established under continuous organic carbon and nitrate feeding, and cyclic oxygen availability. Over a third of the influent organic substrate was respired with nitrate as electron acceptor at high oxygen concentrations (>6.5 mg/L). N2O accounted for up to one-quarter of the nitrate reduced under oxic conditions. The enriched microorganisms maintained a constitutive abundance of denitrifying enzymes due to the oxic/anoxic frequencies exceeding their protein turnover—a common scenario in natural and engineered ecosystems. The aerobic denitrification rates are ascribed primarily to the residual activity of anaerobically synthesised enzymes. From an ecological perspective, the selection of organisms capable of sustaining significant denitrifying activity during aeration shows their competitive advantage over other heterotrophs under varying oxygen availabilities. Ultimately, we propose that the contribution of heterotrophic denitrification to aerobic nitrogen turnover and N2O emissions is currently underestimated in dynamic environments.
SAFPred
Synteny-aware gene function prediction for bacteria using protein embeddings
Motivation: Today, we know the function of only a small fraction of the protein sequences predicted from genomic data. This problem is even more salient for bacteria, which represent some of the most phylogenetically and metabolically diverse taxa on Earth. This low rate of bacterial gene annotation is compounded by the fact that most function prediction algorithms have focused on eukaryotes, and conventional annotation approaches rely on the presence of similar sequences in existing databases. However, often there are no such sequences for novel bacterial proteins. Thus, we need improved gene function prediction methods tailored for bacteria. Recently, transformer-based language models - adopted from the natural language processing field - have been used to obtain new representations of proteins, to replace amino acid sequences. These representations, referred to as protein embeddings, have shown promise for improving annotation of eukaryotes, but there have been only limited applications on bacterial genomes. Results: To predict gene functions in bacteria, we developed SAFPred, a novel synteny-aware gene function prediction tool based on protein embeddings from state-of-the-art protein language models. SAFpred also leverages the unique operon structure of bacteria through conserved synteny. SAFPred outperformed both conventional sequence-based annotation methods and state-of-the-art methods on multiple bacterial species, including for distant homolog detection, where the sequence similarity to the proteins in the training set was as low as 40%. Using SAFPred to identify gene functions across diverse enterococci, of which some species are major clinical threats, we identified 11 previously unrecognized putative novel toxins, with potential significance to human and animal health.
Background: In-feed antibiotic growth promoters (AGPs) have been a cornerstone in the livestock industry due to their role in enhancing growth and feed efficiency. However, concerns over antibiotic resistance have driven a shift away from AGPs toward natural alternatives. Despite the widespread use, the exact mechanisms of AGPs and alternatives are not fully understood. This necessitates holistic studies that investigate microbiota dynamics, host responses, and the interactions between these elements in the context of AGPs and alternative feed additives. Methods: In this study, we conducted a multifaceted investigation of how Bacitracin, a common AGP, and a natural alternative impact both cecum microbiota and host expression in chickens. In addition to univariate and static differential abundance and expression analyses, we employed multivariate and time-course analyses to study this problem. To reveal host-microbe interactions, we assessed their overall correspondence and identified treatment-specific pairs of species and host expressed genes that showed significant correlations over time. Results: Our analysis revealed that factors such as developmental age substantially impacted the cecum ecosystem more than feed additives. While feed additives significantly altered microbial compositions in the later stages, they did not significantly affect overall host gene expression. The differential expression indicated that with AGP administration, host transmembrane transporters and metallopeptidase activities were upregulated around day 21. Together with the modulated kininogen binding and phenylpyruvate tautomerase activity over time, this likely contributes to the growth-promoting effects of AGPs. The difference in responses between AGP and PFA supplementation suggests that these additives operate through distinct mechanisms. Conclusion: We investigated the impact of a common AGP and its natural alternative on poultry cecum ecosystem through an integrated analysis of both the microbiota and host responses. We found that AGP appears to enhance host nutrient utilization and modulate immune responses. The insights we gained are critical for identifying and developing effective AGP alternatives to advance sustainable livestock farming practices.
Combinatorial pathway optimization is an important tool in metabolic flux optimization. Simultaneous optimization of a large number of pathway genes often leads to combinatorial explosions. Strain optimization is therefore often performed using iterative design-build-test-learn (DBTL) cycles. The aim of these cycles is to develop a product strain iteratively, every time incorporating learning from the previous cycle. Machine learning methods provide a potentially powerful tool to learn from data and propose new designs for the next DBTL cycle. However, due to the lack of a framework for consistently testing the performance of machine learning methods over multiple DBTL cycles, evaluating the effectiveness of these methods remains a challenge. In this work, we propose a mechanistic kinetic model-based framework to test and optimize machine learning for iterative combinatorial pathway optimization. Using this framework, we show that gradient boosting and random forest models outperform the other tested methods in the low-data regime. We demonstrate that these methods are robust for training set biases and experimental noise. Finally, we introduce an algorithm for recommending new designs using machine learning model predictions. We show that when the number of strains to be built is limited, starting with a large initial DBTL cycle is favorable over building the same number of strains for every cycle.
Background: Pan-genome graphs are gaining importance in the field of bioinformatics as data structures to represent and jointly analyze multiple genomes. Compacted de Bruijn graphs are inherently suited for this purpose, as their graph topology naturally reveals similarity and divergence within the pan-genome. Most state-of-the-art pan-genome graphs are represented explicitly in terms of nodes and edges. Recently, an alternative, implicit graph representation was proposed that builds directly upon the unidirectional FM-index. As such, a memory-efficient graph data structure is obtained that inherits the FM-index’ backward search functionality. However, this representation suffers from a number of shortcomings in terms of functionality and algorithmic performance. Results: We present a data structure for a pan-genome, compacted de Bruijn graph that aims to address these shortcomings. It is built on the bidirectional FM-index, extending the ability of its unidirectional counterpart to navigate and search the graph in both directions. All basic graph navigation steps can be performed in constant time. Based on these features, we implement subgraph visualization as well as lossless approximate pattern matching to the graph using search schemes. We demonstrate that we can retrieve all occurrences corresponding to a read within a certain edit distance in a very efficient manner. Through a case study, we show the potential of exploiting the information embedded in the graph’s topology through visualization and sequence alignment. Conclusions: We propose a memory-efficient representation of the pan-genome graph that supports subgraph visualization and lossless approximate pattern matching of reads against the graph using search schemes. The C++ source code of our software, called Nexus, is available at https://github.com/biointec/nexus under AGPL-3.0 license.
SHIP
Identifying antimicrobial resistance gene transfer between plasmids
Motivation: Plasmids are carriers for antimicrobial resistance (AMR) genes and can exchange genetic material with other structures, contributing to the spread of AMR. There is no reliable approach to identify the transfer of AMR genes across plasmids. This is mainly due to the absence of a method to assess the phylogenetic distance of plasmids, as they show large DNA sequence variability. Identifying and quantifying such transfer can provide novel insight into the role of small mobile elements and resistant plasmid regions in the spread of AMR. Results: We developed SHIP, a novel method to quantify plasmid similarity based on the dynamics of plasmid evolution. This allowed us to find conserved fragments containing AMR genes in structurally different and phylogenetically distant plasmids, which is evidence for lateral transfer. Our results show that regions carrying AMR genes are highly mobilizable between plasmids through transposons, integrons, and recombination events, and contribute to the spread of AMR. Identified transferred fragments include a multi-resistant complex class 1 integron in Escherichia coli and Klebsiella pneumoniae, and a region encoding tetracycline resistance transferred through recombination in Enterococcus faecalis.