Preventing overfitting in Mixed Integer Optimization based classification tree construction

Bachelor thesis (2019)

Authors

M.B. Elgersma Electrical Engineering, Mathematics and Computer Science

Contributors

L.J.J. van Iersel Discrete Mathematics and Optimization - (mentor)

Faculty

Electrical Engineering, Mathematics and Computer Science, Electrical Engineering, Mathematics and Computer Science

DNA Overfitting Mixed Integer Optimization Classification trees Candida Glabrata

To reference this document use:

http://resolver.tudelft.nl/uuid:542b608f-0e48-474b-b1e1-bf1096ae7fe3

More Info

expand_more

Published Date

10-07-2019

Language

English

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Faculty

Electrical Engineering, Mathematics and Computer Science

Abstract

The global use of azole antifungals as treatment against infections caused by Candida has led to an increase in azole resistance. The primal goal of this BSc thesis is to improve existing Mixed Integer Optimization models to classify azole resistance of C. glabrata more accurately by preventing overfitting. Moreover, these classification methods can generally also be used for the classification of any kind of numerical of categorical data. The classification method that we used is based on an MIO formulation that was first introduced by Bertsimas and Dunn, and later adapted by Van Dijk. We first made the output of the model much easier to interpret, both from a mathematical and biological point of view. We also applied feature sampling to reduce the run time of the program, making it possible to create deeper trees, and to prevent overfitting. To further prevent overfitting in these deeper trees, we added the option of forcing at least a certain number of training data points in the leaves to the MIO formulation. We verified our MIO model on a data set constructed by combining two data sets from the Westerdijk Fungal Biodiversity institute and a data set from the Center for Disease Control and Prevention Atlanta, all containing the FKS1 and FKS2 gene sequences from C. Glabrata. We automated the preprocessing steps and merging process of these data sets with a Python program, and wrote a user manual on how to use this program. By processing a bigger data set we were able to classify more data correctly than Van Dijk, and we outperformed the CART algorithm. Similar accuracy results were obtained when applying feature sampling as when not, and the run time was drastically reduced. Deeper trees did not change out-of-sample accuracy much, though this may be because our data sets did not require deeper trees. When also forcing at least a certain number of training data points in each leaf of these deeper trees, we were able to slightly increase the out-of-sample accuracy, which means overfitting was indeed prevented slightly. Lastly we interpreted the results in biological context, and found some resistance-related mutations that were already identified previously in other research, as well as some additional ones for which the biological relevance is yet unknown.

Files

BEP_repository.pdf

(.pdf | 2.09 Mb)