Leen Stougie | TU Delft Repository

Inferring phylogenetic networks from multifurcating trees via cherry picking and machine learning

Journal article (2024) - Giulia Bernardini, Leo van Iersel, Esther Julien, Leen Stougie

The Hybridization problem asks to reconcile a set of conflicting phylogenetic trees into a single phylogenetic network with the smallest possible number of reticulation nodes. This problem is computationally hard and previous solutions are limited to small and/or severely restricted data sets, for example, a set of binary trees with the same taxon set or only two non-binary trees with non-equal taxon sets. Building on our previous work on binary trees, we present FHyNCH, the first algorithmic framework to heuristically solve the Hybridization problem for large sets of multifurcating trees whose sets of taxa may differ. Our heuristics combine the cherry-picking technique, recently proposed to solve the same problem for binary trees, with two carefully designed machine-learning models. We demonstrate that our methods are practical and produce qualitatively good solutions through experiments on both synthetic and real data sets. ...

Constructing phylogenetic networks via cherry picking and machine learning

Journal article (2023) - Giulia Bernardini, Leo van Iersel, Esther Julien, Leen Stougie

Background: Combining a set of phylogenetic trees into a single phylogenetic network that explains all of them is a fundamental challenge in evolutionary studies. Existing methods are computationally expensive and can either handle only small numbers of phylogenetic trees or are limited to severely restricted classes of networks. Results: In this paper, we apply the recently-introduced theoretical framework of cherry picking to design a class of efficient heuristics that are guaranteed to produce a network containing each of the input trees, for practical-size datasets consisting of binary trees. Some of the heuristics in this framework are based on the design and training of a machine learning model that captures essential information on the structure of the input trees and guides the algorithms towards better solutions. We also propose simple and fast randomised heuristics that prove to be very effective when run multiple times. Conclusions: Unlike the existing exact methods, our heuristics are applicable to datasets of practical size, and the experimental study we conducted on both simulated and real data shows that these solutions are qualitatively good, always within some small constant factor from the optimum. Moreover, our machine-learned heuristics are one of the first applications of machine learning to phylogenetics and show its promise. ...

Reconstructing Phylogenetic Networks via Cherry Picking and Machine Learning

Conference paper (2022) - Giulia Bernardini, Leo van Iersel, Esther Julien, Leen Stougie

Combining a set of phylogenetic trees into a single phylogenetic network that explains all of them is a fundamental challenge in evolutionary studies. In this paper, we apply the recently-introduced theoretical framework of cherry picking to design a class of heuristics that are guaranteed to produce a network containing each of the input trees, for practical-size datasets. The main contribution of this paper is the design and training of a machine learning model that captures essential information on the structure of the input trees and guides the algorithms towards better solutions. This is one of the first applications of machine learning to phylogenetic studies, and we show its promise with a proof-of-concept experimental study conducted on both simulated and real data consisting of binary trees with no missing taxa. ...

Maximum parsimony distance on phylogenetic trees

A linear kernel and constant factor approximation algorithm

Journal article (2021) - Mark Jones, Steven Kelk, Leen Stougie

Maximum parsimony distance is a measure used to quantify the dissimilarity of two unrooted phylogenetic trees. It is NP-hard to compute, and very few positive algorithmic results are known due to its complex combinatorial structure. Here we address this shortcoming by showing that the problem is fixed parameter tractable. We do this by establishing a linear kernel i.e., that after applying certain reduction rules the resulting instance has size that is bounded by a linear function of the distance. As powerful corollaries to this result we prove that the problem permits a polynomial-time constant-factor approximation algorithm; that the treewidth of a natural auxiliary graph structure encountered in phylogenetics is bounded by a function of the distance; and that the distance is within a constant factor of the size of a maximum agreement forest of the two trees, a well studied object in phylogenetics. ...

Applicability of several rooted phylogenetic network algorithms for representing the evolutionary history of SARS-CoV-2

Journal article (2021) - Rosanne Wallin, Leo van Iersel, Steven Kelk, Leen Stougie

Background: Rooted phylogenetic networks are used to display complex evolutionary history involving so-called reticulation events, such as genetic recombination. Various methods have been developed to construct such networks, using for example a multiple sequence alignment or multiple phylogenetic trees as input data. Coronaviruses are known to recombine frequently, but rooted phylogenetic networks have not yet been used extensively to describe their evolutionary history. Here, we created a workflow to compare the evolutionary history of SARS-CoV-2 with other SARS-like viruses using several rooted phylogenetic network inference algorithms. This workflow includes filtering noise from sets of phylogenetic trees by contracting edges based on branch length and bootstrap support, followed by resolution of multifurcations. We explored the running times of the network inference algorithms, the impact of filtering on the properties of the produced networks, and attempted to derive biological insights regarding the evolution of SARS-CoV-2 from them. Results: The network inference algorithms are capable of constructing rooted phylogenetic networks for coronavirus data, although running-time limitations require restricting such datasets to a relatively small number of taxa. Filtering generally reduces the number of reticulations in the produced networks and increases their temporal consistency. Taxon bat-SL-CoVZC45 emerges as a major and structural source of discordance in the dataset. The tested algorithms often indicate that SARS-CoV-2/RaTG13 is a tree-like clade, with possibly some reticulate activity further back in their history. A smaller number of constructed networks posit SARS-CoV-2 as a possible recombinant, although this might be a methodological artefact arising from the interaction of bat-SL-CoVZC45 discordance and the optimization criteria used. Conclusion: Our results demonstrate that as part of a wider workflow and with careful attention paid to running time, rooted phylogenetic network algorithms are capable of producing plausible networks from coronavirus data. These networks partly corroborate existing theories about SARS-CoV-2, and partly produce new avenues for exploration regarding the location and significance of reticulate activity within the wider group of SARS-like viruses. Our workflow may serve as a model for pipelines in which phylogenetic network algorithms can be used to analyse different datasets and test different hypotheses. ...

Background: Rooted phylogenetic networks are used to display complex evolutionary history involving so-called reticulation events, such as genetic recombination. Various methods have been developed to construct such networks, using for example a multiple sequence alignment or multiple phylogenetic trees as input data. Coronaviruses are known to recombine frequently, but rooted phylogenetic networks have not yet been used extensively to describe their evolutionary history. Here, we created a workflow to compare the evolutionary history of SARS-CoV-2 with other SARS-like viruses using several rooted phylogenetic network inference algorithms. This workflow includes filtering noise from sets of phylogenetic trees by contracting edges based on branch length and bootstrap support, followed by resolution of multifurcations. We explored the running times of the network inference algorithms, the impact of filtering on the properties of the produced networks, and attempted to derive biological insights regarding the evolution of SARS-CoV-2 from them. Results: The network inference algorithms are capable of constructing rooted phylogenetic networks for coronavirus data, although running-time limitations require restricting such datasets to a relatively small number of taxa. Filtering generally reduces the number of reticulations in the produced networks and increases their temporal consistency. Taxon bat-SL-CoVZC45 emerges as a major and structural source of discordance in the dataset. The tested algorithms often indicate that SARS-CoV-2/RaTG13 is a tree-like clade, with possibly some reticulate activity further back in their history. A smaller number of constructed networks posit SARS-CoV-2 as a possible recombinant, although this might be a methodological artefact arising from the interaction of bat-SL-CoVZC45 discordance and the optimization criteria used. Conclusion: Our results demonstrate that as part of a wider workflow and with careful attention paid to running time, rooted phylogenetic network algorithms are capable of producing plausible networks from coronavirus data. These networks partly corroborate existing theories about SARS-CoV-2, and partly produce new avenues for exploration regarding the location and significance of reticulate activity within the wider group of SARS-like viruses. Our workflow may serve as a model for pipelines in which phylogenetic network algorithms can be used to analyse different datasets and test different hypotheses.

BacHBerry

BACterial Hosts for production of Bioactive phenolics from bERRY fruits

Journal article (2018) - Alexey Dudnik, A. Filipa Almeida, Dario Breitel, Rex Brennan, Laurent Bulteau, Celine Chanforan, Inês Costa, Rafael S. Costa, Mahdi Doostmohammadi, Nuno Faria, Chengyong Feng, Armando Fernandes, Ricardo Andrade, Patricia Ferreira, Roberto Ferro, Alexandre Foito, Sabine Freitag, Gonçalo Garcia, Paula Gaspar, Joana Godinho-Pereira, Björn Hamberger, András Hartmann, Harald Heider, Barbara Avila, Carolina Jardim, Alice Julien-Laferriere, Nicolai Kallscheuer, Wolfgang Kerbe, Oscar P. Kuipers, Shanshan Li, Nicola Love, Alberto Marchetti-Spaccamela, Jan Marienhagen, Cathie Martin, Pilar Bañados, Arnaud Mary, Vincent Mazurek, Camillo Meinhart, David Méndez Sevillano, Regina Menezes, Michael Naesby, Morten H.H. Nørholm, Finn T. Okkels, Joana Oliveira, Marcel Ottens, Diane Barbay, Delphine Parrot, Lei Pei, Isabel Rocha, Rita Rosado-Ramos, Caroline Rousseau, Marie France Sagot, Claudia Nunes Dos Santos, Markus Schmidt, Tatiana Shelenga, Louise Shepherd, Jean Etienne Bassard, Ana Rita Silva, Marcelo Henriques da Silva, Olivier Simon, Steen Gustav Stahlhut, Ana Solopova, Artem Sorokin, Derek Stewart, Leen Stougie, Shang Su, Vera Thole, Mounir Benkoulouche, Olga Tikhonova, Martin Trick, Philippe Vain, André Veríssimo, Ana Vila-Santa, Susana Vinga, Michael Vogt, Liangsheng Wang, Lijin Wang, Wei Wei, Michael Bott, Sandra Youssef, Ana Rute Neves, Jochen Forster, Adelaide Braga

BACterial Hosts for production of Bioactive phenolics from bERRY fruits (BacHBerry) was a 3-year project funded by the Seventh Framework Programme (FP7) of the European Union that ran between November 2013 and October 2016. The overall aim of the project was to establish a sustainable and economically-feasible strategy for the production of novel high-value phenolic compounds isolated from berry fruits using bacterial platforms. The project aimed at covering all stages of the discovery and pre-commercialization process, including berry collection, screening and characterization of their bioactive components, identification and functional characterization of the corresponding biosynthetic pathways, and construction of Gram-positive bacterial cell factories producing phenolic compounds. Further activities included optimization of polyphenol extraction methods from bacterial cultures, scale-up of production by fermentation up to pilot scale, as well as societal and economic analyses of the processes. This review article summarizes some of the key findings obtained throughout the duration of the project. ...

On Unrooted and Root-Uncertain Variants of Several Well-Known Phylogenetic Network Problems

Journal article (2017) - Leo van Iersel, Steven Kelk, Georgios Stamoulis, Leen Stougie, Olivier Boes

The hybridization number problem requires us to embed a set of binary rooted phylogenetic trees into a binary rooted phylogenetic network such that the number of nodes with indegree two is minimized. However, from a biological point of view accurately inferring the root location in a phylogenetic tree is notoriously difficult and poor root placement can artificially inflate the hybridization number. To this end we study a number of relaxed variants of this problem. We start by showing that the fundamental problem of determining whether an unrooted phylogenetic network displays (i.e. embeds) an unrooted phylogenetic tree, is NP-hard. On the positive side we show that this problem is FPT in reticulation number. In the rooted case the corresponding FPT result is trivial, but here we require more subtle argumentation. Next we show that the hybridization number problem for unrooted networks (when given two unrooted trees) is equivalent to the problem of computing the tree bisection and reconnect distance of the two unrooted trees. In the third part of the paper we consider the “root uncertain” variant of hybridization number. Here we are free to choose the root location in each of a set of unrooted input trees such that the hybridization number of the resulting rooted trees is minimized. On the negative side we show that this problem is APX-hard. On the positive side, we show that the problem is FPT in the hybridization number, via kernelization, for any number of input trees. ...

WhatsHap

Weighted haplotype assembly for future-generation sequencing reads

Journal article (2015) - Murray Patterson, Tobias Marschall, Nadia Pisanti, Leo Van Iersel, Leen Stougie, Gunnar W Klau, Alexander Schönhuth

Networks: expanding evolutionary thinking

Journal article (2013) - Eric Bapteste, Leo van Iersel, Axel Janke, Scot Kelchner, Steven Kelk, James O McInerney, David A Morrison, Luay Nakhleh, Mike Steel, Leen Stougie

Cycle killer... qu'est-ce que c'est? On the comparative approximability of hybridization number and directed feedback vertex set

Journal article (2012) - Steven Kelk, Leo Van Iersel, Nela Lekic, Simone Linz, Celine Scornavacca, Leen Stougie

We show that the problem of computing the hybridization number of two rooted binary phylogenetic trees on the same set of taxa $X$ has a constant factor polynomial-time approximation if and only if the problem of computing a minimum-size feedback vertex set in a directed graph (DFVS) has a constant factor polynomial-time approximation. The latter problem, which asks for a minimum number of vertices to be removed from a directed graph to transform it into a directed acyclic graph, is one of the problems in Karp's seminal 1972 list of 21 NP-complete problems. Despite considerable attention from the combinatorial optimization community, it remains to this day unknown whether a constant factor polynomial-time approximation exists for DFVS. Our result thus places the (in)approximability of hybridization number in a much broader complexity context, and as a consequence we obtain that it inherits inapproximability results from the problem Vertex Cover. On the positive side, we use results from the DFVS literature to give an $\text{O}( \log r \log \log r)$ approximation for the hybridization number where $r$ is the correct value. ...

Constructing level-2 phylogenetic networks from triplets

Journal article (2009) - Leo Van Iersel, Judith Keijsper, Steven Kelk, Leen Stougie, Ferry Hagen, Teun Boekhout

Prefix reversals on binary and ternary strings

Journal article (2007) - Cor Hurkens, Leo Van Iersel, Judith Keijsper, Steven Kelk, Leen Stougie, John Tromp

A decade of combinatorial optimization

Book chapter (1997) - Karen Aardal, C.P.M. van Hoesel, Jan Karel Lenstra, Leen Stougie