Counterfactual Explanations and Algorithmic Recourse for Trustworthy AI
P. Altmeyer (TU Delft - Multimedia Computing)
C.C.S. Liem – Promotor (TU Delft - Multimedia Computing)
A. van Deursen – Promotor (TU Delft - Software Engineering)
More Info
expand_more
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.
Abstract
Many of the most celebrated recent advances in artificial intelligence (AI) have been built on the back of highly complex and opaque models that need little human oversight to achieve strong predictive performance. But while their capacity to recognize patterns from raw data is impressive, their decision-making process is neither robust nor well understood. This has so far inhibited trust and widespread adoption of these technologies. This thesis contributes to research efforts aimed at tackling these challenges, through interdisciplinary insights and methodological contributions.
The principle goal of this work is to contribute methods that help us in making opaque AI models more trustworthy. Specifically, we aim to (1) explore and challenge existing technologies and paradigms in the field; (2) improve our ability to hold opaque models accountable through thorough scrutiny; and, (3) leverage the results of such scrutiny during training to improve the trustworthiness of models. Methodologically, the thesis focuses on counterfactual explanations and algorithmic recourse for individuals subjected to opaque AI systems. We explore what type of real-world dynamics can be expected to play out when recourse is provided and implemented in practice. Based on our finding that individual cost minimization–a core objective in recourse–neglects hidden external costs of recourse itself, we revisit yet another established objective: namely, that explanations should be plausible first and foremost. Our work demonstrates that a narrow focus on this objective can mislead us into trusting fundamentally untrustworthy systems. To avoid this scenario, we propose a novel method that aids us in disclosing explanations that are maximally faithful, that is consistent with the behavior of models. This not only allows us to assess the trustworthiness of models, but also improve it: we show that faithful explanations can be used during training to ensure that models learn plausible explanations.
Finally, we also critically assess efforts towards trustworthy AI in the context of modern large language models (LLM). Specifically, we cast doubt on recent findings and practices presented in the field of mechanistic interpretability and caution our fellow researchers in this space against misinterpreting and inflating their findings.
In summary, this thesis makes cutting-edge research contributions that improve our ability to make opaque AI models more trustworthy. Beyond our core research contributions, this thesis makes substantial contributions to open-source software. Through various software packages that we have developed, we make our research and that of others more accessible.