Classifying Candida species using Mixed Integer Optimization based optimal classification trees

None, None

Classifying Candida species using Mixed Integer Optimization based optimal classification trees

Master Thesis (2019)

Author(s)

Mick van Dijk (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

Leo van Iersel – Mentor (TU Delft - Discrete Mathematics and Optimization)

Prof. dr. L. Stougie – Mentor

Ir. S. Kelk – Mentor (Maastricht University)

Prof. dr. T. Boekhout – Graduation committee member

Karen Aardal – Graduation committee member (TU Delft - Discrete Mathematics and Optimization)

Faculty

Electrical Engineering, Mathematics and Computer Science

Machine Learning Optimization Bioinformatics

To reference this document use:

https://resolver.tudelft.nl/uuid:068ff836-099b-4abe-8d9e-cf96706169df

More Info

expand_more

Publication Year

2019

Language

English

Graduation Date

28-01-2019

Awarding Institution

Delft University of Technology

Programme

['Applied Mathematics']

Faculty

Electrical Engineering, Mathematics and Computer Science

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Global medical use of azole antifungals and echinocandins has led to an enormous increase in resistant Candida species, that are most commonly associated with fungal infections. A possible mechanism causing resistance are single or simultaneous point mutations in the genes responsible for encoding antifungal target enzymes. The aim of this thesis is to apply and compare several classification algorithms, in particular decision tree algorithms, on Candida data sets received from the Westerdijk Fungal Biodiversity Institute. Bertsimas and Dunn recently introduced a novel formulation based on Mixed Integer Optimization to generate optimal classification trees. We have implemented this method and applied it on C. albicans and C. glabrata data sets to construct univariate and multivariate classification trees. We were able to correctly classify 68-72% of the C. albicans isolates and 76.5-82.5% of C. glabrata isolates. Moreover, by changing the objective function and adding constraints to the original MIO formulation, we constructed trees that take into consideration false negative errors, decreasing this type of error by 64-80% for C. albicans and 56-66% for C. glabrata. To deal with ambiguous nucleotides in the C. albicans data set we introduced a novel formulation to construct non-binary classification trees. It turned out that ternary trees are a good representation of the C. albicans data set, performing strong in terms of out-of-sample accuracy. Finally, we identified combinations of amino acid substitutions and nucleotide mutations possibly related to resistance in C. albicans and C. glabrata.

Files

Thesis_Mick_van_Dijk_TU_Delft_... (pdf)

(pdf | 0.562 Mb)

License info not available