Preventing overfitting in Mixed Integer Optimization based classification tree construction

More Info
expand_more

Abstract

The global use of azole antifungals as treatment against infections caused by Candida has led to an increase in azole resistance. The primal goal of this BSc thesis is to improve existing Mixed Integer Optimization models to classify azole resistance of C. glabrata more accurately by preventing overfitting. Moreover, these classification methods can generally also be used for the classification of any kind of numerical of categorical data. The classification method that we used is based on an MIO formulation that was first introduced by Bertsimas and Dunn, and later adapted by Van Dijk. We first made the output of the model much easier to interpret, both from a mathematical and biological point of view. We also applied feature sampling to reduce the run time of the program, making it possible to create deeper trees, and to prevent overfitting. To further prevent overfitting in these deeper trees, we added the option of forcing at least a certain number of training data points in the leaves to the MIO formulation. We verified our MIO model on a data set constructed by combining two data sets from the Westerdijk Fungal Biodiversity institute and a data set from the Center for Disease Control and Prevention Atlanta, all containing the FKS1 and FKS2 gene sequences from C. Glabrata. We automated the preprocessing steps and merging process of these data sets with a Python program, and wrote a user manual on how to use this program. By processing a bigger data set we were able to classify more data correctly than Van Dijk, and we outperformed the CART algorithm. Similar accuracy results were obtained when applying feature sampling as when not, and the run time was drastically reduced. Deeper trees did not change out-of-sample accuracy much, though this may be because our data sets did not require deeper trees. When also forcing at least a certain number of training data points in each leaf of these deeper trees, we were able to slightly increase the out-of-sample accuracy, which means overfitting was indeed prevented slightly. Lastly we interpreted the results in biological context, and found some resistance-related mutations that were already identified previously in other research, as well as some additional ones for which the biological relevance is yet unknown.

Files