Classifying Candida species using Mixed Integer Optimization based optimal classification trees

More Info
expand_more

Abstract

Global medical use of azole antifungals and echinocandins has led to an enormous increase in resistant Candida species, that are most commonly associated with fungal infections. A possible mechanism causing resistance are single or simultaneous point mutations in the genes responsible for encoding antifungal target enzymes. The aim of this thesis is to apply and compare several classification algorithms, in particular decision tree algorithms, on Candida data sets received from the Westerdijk Fungal Biodiversity Institute. Bertsimas and Dunn recently introduced a novel formulation based on Mixed Integer Optimization to generate optimal classification trees. We have implemented this method and applied it on C. albicans and C. glabrata data sets to construct univariate and multivariate classification trees. We were able to correctly classify 68-72% of the C. albicans isolates and 76.5-82.5% of C. glabrata isolates. Moreover, by changing the objective function and adding constraints to the original MIO formulation, we constructed trees that take into consideration false negative errors, decreasing this type of error by 64-80% for C. albicans and 56-66% for C. glabrata. To deal with ambiguous nucleotides in the C. albicans data set we introduced a novel formulation to construct non-binary classification trees. It turned out that ternary trees are a good representation of the C. albicans data set, performing strong in terms of out-of-sample accuracy. Finally, we identified combinations of amino acid substitutions and nucleotide mutations possibly related to resistance in C. albicans and C. glabrata.