Leveraging Feature Extraction to Detect Adversarial Examples

Let's Meet in the Middle

Master Thesis (2024)
Author(s)

R. Stenhuis (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

Katai Liang – Mentor (TU Delft - Cyber Security)

Dazhuang Liu – Graduation committee member (TU Delft - Cyber Security)

S.E. Verwer – Graduation committee member (TU Delft - Algorithmics)

Jeremie Decouchant – Graduation committee member (TU Delft - Data-Intensive Systems)

Faculty
Electrical Engineering, Mathematics and Computer Science
More Info
expand_more
Publication Year
2024
Language
English
Graduation Date
03-10-2024
Awarding Institution
Delft University of Technology
Programme
['Computer Science | Cyber Security']
Faculty
Electrical Engineering, Mathematics and Computer Science
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Previous research has explored the detection of adversarial examples with dimensional reduction and Out-of-Distribution (OOD) recognition. However, these approaches are not effective against white-box adversarial attacks. Moreover, recent OOD methods that utilize hidden units hinder the scalability of the target model.

For that reason, various explanations of adversarial examples are studied to get a better understanding about its properties and anomalies. Furthermore, we discuss the added value of using natural scene statistics and utility functions to improve the relevance of the features for detection. By utilizing the anomalies we identified for adversarial examples in an ensemble, this thesis is the first to propose a robust solution for adaptive and white-box attacks.

Particularly, we address these challenges with MeetSafe. A Gaussian Mixture Model that leverages principal component analysis, feature squeezing, and density estimation to detect adaptive white-box adversaries. Furthermore, our enhanced Local Reachability Density (LRD) algorithm further improves the efficiency of state-of-the-art OOD methods. In particular, the proposed LRD enhances scalability by feature bagging hidden units with large absolute Z-scores. We then show that predictors, including LRD, are far more effective in ensembles like MeetSafe which supports prior conjectures that a range of different heuristics may further constrain adversaries when combined.

Extensive experiments on 14 models show that MeetSafe detects adaptive perturbations with an accuracy of 62% on STL-10, 75% on CIFAR-10, and 99% on MNIST using either adversarial training or Reverse Cross Entropy (RCE), achieving an improvement of at least 8.1% for each evaluated method by averaging across the three datasets.

Files

License info not available