Creating phylogenetic networks from clusters

Master thesis (2023)

Authors

J. Keur Electrical Engineering, Mathematics and Computer Science

Contributors

L.J.J. van Iersel Discrete Mathematics and Optimization - (mentor)

Cornelis Vuik Delft Institute of Applied Mathematics (graduation committee member)

Faculty

Electrical Engineering, Mathematics and Computer Science, Electrical Engineering, Mathematics and Computer Science

More Info

expand_more

To reference this document use:

http://resolver.tudelft.nl/uuid:4dd0bbf1-a734-4e72-af8f-de3c3651c760

Published Date

22-02-2023

Language

English

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Faculty

Electrical Engineering, Mathematics and Computer Science

Abstract

In biology, phylogenetics is the study of the evolutionary history of and relations between e.g. species. Such data are often represented in trees. Remarkably, trees lack the representation of reticulation events, such as hybridization, while such events are believed to be important. One of the reasons why trees are widely used, is the enormous complexity of inferring a network directly from DNA data.
Therefore, indirect approaches are studied. One option is to construct a network from trees, which have been constructed from DNA data. Another option, which is studied in this work, is to create a network which represents all clusters from the trees. We are interested in finding a network having the lowest level possible, i.e. the lowest number of reticulations per biconnected component. The Cass algorithm is one of the best algorithms in this respect. However, it does not always give optimal results. In this thesis, an optimal algorithm is presented called STCass, of which many ideas are inspired from Cass. We lay
theoretical foundations on how to build an optimal network. Furthermore, (greedy) steps are proposed to solve the problem, which seem to work very well in practice. Based on several optimizations that are proposed, an improved algorithm called MSTCass is presented, which runs faster than STCass, in general. In order to speed up any Cass-based algorithm or derivatives of it, several lower bounds are presented on the reticulation number r(N) per biconnected component N of an optimal network. Besides, a new upper bound on r(N) is conjectured which can be evaluated in seconds, although e.g. an upper bound obtained by the HybridInterleave or PIRN algorithm is mostly lower and which can be evaluated often in twenty minutes for the tested instances.

Files

Thesis_Keur.pdf

(.pdf | 4.8 Mb)