Improving the robustness of decision trees in security-sensitive setting

Master Thesis (2020)
Author(s)

S.J.M. Buijs (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

S.E. Verwer – Mentor (TU Delft - Cyber Security)

Reginald L. Lagendijk – Graduation committee member (TU Delft - Cyber Security)

David M.J. Tax – Graduation committee member (TU Delft - Pattern Recognition and Bioinformatics)

Faculty
Electrical Engineering, Mathematics and Computer Science
Copyright
© 2020 Cas Buijs
More Info
expand_more
Publication Year
2020
Language
English
Copyright
© 2020 Cas Buijs
Graduation Date
13-08-2020
Awarding Institution
Delft University of Technology
Programme
Computer Science | Cyber Security
Faculty
Electrical Engineering, Mathematics and Computer Science
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Machine learning is used for security purposes, to differ between the benign and the malicious. Where decision trees can lead to understandable and explainable classifications, an adversary could manipulate the model input to evade detection, e.g. the malicious been classified as the benign. State-of-the-art techniques improve the robustness by taking these adversarial attacks into account when building the model. In this work, I identify three factors contributing to the robustness of a decision tree: feature frequency, shortest distance between malicious leaves and benign prediction space, and impurity of benign prediction space. I propose two splitting criteria to improve these factors and suggest a combination with two trade-off approaches to balance the use of these splitting criteria with a common splitting criterion, Gini Impurity, in order to balance accuracy and robustness. These combinations allow building robuster models against adversaries manipulating the malicious data without considering adversarial attacks. The approaches are evaluated in a white-box setting against a decision tree and random forest, considering an unbounded adversary where robustness is measured using a L1-distance norm and the false negative rate. All combinations lead to robuster models at different costs in terms of accuracy, showing that adversarial attacks do not need to be taken into account to improve robustness. Compared to state-of-the-art work, the best approach achieves on average 3.17% better accuracy with an on average lower robustness of 5.5% on the used datasets for a single decision tree. In a random forest the best approach achieves on average 2.87% better robustness with a 2.37% better accuracy on the used datasets compared to the state-of-the-art work. The state-of-the-art work does not seem to affect all of the identified factors, which leaves room for even robuster models than currently existing.

Files

Thesis_Cas.pdf
(pdf | 7.92 Mb)
License info not available