Are Neural Networks Robust to Gradient-Based Adversaries Also More Explainable? Evidence from Counterfactuals

Bachelor Thesis (2024)
Author(s)

R. Appachi Senthilkumar (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

P. Altmeyer – Mentor (TU Delft - Multimedia Computing)

Cynthia C.S. Liem – Mentor (TU Delft - Multimedia Computing)

B.J.W. Dudzik – Graduation committee member (TU Delft - Pattern Recognition and Bioinformatics)

Faculty
Electrical Engineering, Mathematics and Computer Science
More Info
expand_more
Publication Year
2024
Language
English
Graduation Date
27-06-2024
Awarding Institution
Delft University of Technology
Project
['CSE3000 Research Project']
Programme
['Computer Science and Engineering']
Faculty
Electrical Engineering, Mathematics and Computer Science
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Adversarial Training has emerged as the most reliable technique to make neural networks robust to gradient-based adversarial perturbations on input data. Besides improving model robustness, preliminary evidence presents an interesting consequence of adversarial training -- increased explainability of model behaviour. Prior work has explored the effects of adversarial training on gradient stability and interpretability, as well as visual explainability of counterfactuals. Our work presents the first quantitative, empirical analysis of the impact of model robustness on model explainability by comparing the plausibility of faithful counterfactuals for both robust and standard networks. We seek to determine whether robust networks learn more plausible decision boundaries and representations of the data than regular models, and whether the strength of the adversary used to train robust models affects their explainability. Our finidngs indicate that robust networks for image data learn more explainable decision boundaries and representations of data than regular models, with more robust models producing more plausible counterfactuals. Robust models for tabular data, however, only conclusively exhibit this phenomenon along decision boundaries and not for its overall data representations, possibly due to its high robustness-accuracy trade-off and the difficulties associated with traditional adversarial training due to its innate properties. We believe our work can help guide future research towards improving the robustness of machine learning models keeping their explainability in mind.

Files

Rithik_Bachelor_Thesis.pdf
(pdf | 0.553 Mb)
License info not available