Classifying Candida species using Mixed Integer Optimization based optimal classification trees

Master Thesis (2019)
Author(s)

M. van Dijk (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

Leo van Iersel – Mentor (TU Delft - Discrete Mathematics and Optimization)

Prof. dr. L. Stougie – Mentor

Steven Kelk – Mentor (Universiteit Maastricht)

Prof. dr. T. Boekhout – Graduation committee member

K.I. Aardal – Graduation committee member (TU Delft - Discrete Mathematics and Optimization)

Faculty
Electrical Engineering, Mathematics and Computer Science
Copyright
© 2019 Mick van Dijk
More Info
expand_more
Publication Year
2019
Language
English
Copyright
© 2019 Mick van Dijk
Graduation Date
28-01-2019
Awarding Institution
Delft University of Technology
Programme
Applied Mathematics
Faculty
Electrical Engineering, Mathematics and Computer Science
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Global medical use of azole antifungals and echinocandins has led to an enormous increase in resistant Candida species, that are most commonly associated with fungal infections. A possible mechanism causing resistance are single or simultaneous point mutations in the genes responsible for encoding antifungal target enzymes. The aim of this thesis is to apply and compare several classification algorithms, in particular decision tree algorithms, on Candida data sets received from the Westerdijk Fungal Biodiversity Institute. Bertsimas and Dunn recently introduced a novel formulation based on Mixed Integer Optimization to generate optimal classification trees. We have implemented this method and applied it on C. albicans and C. glabrata data sets to construct univariate and multivariate classification trees. We were able to correctly classify 68-72% of the C. albicans isolates and 76.5-82.5% of C. glabrata isolates. Moreover, by changing the objective function and adding constraints to the original MIO formulation, we constructed trees that take into consideration false negative errors, decreasing this type of error by 64-80% for C. albicans and 56-66% for C. glabrata. To deal with ambiguous nucleotides in the C. albicans data set we introduced a novel formulation to construct non-binary classification trees. It turned out that ternary trees are a good representation of the C. albicans data set, performing strong in terms of out-of-sample accuracy. Finally, we identified combinations of amino acid substitutions and nucleotide mutations possibly related to resistance in C. albicans and C. glabrata.

Files

License info not available