Investigating Unfaithful Behavior in Neural Rationale Models

Master Thesis (2025)
Author(s)

L.E. Holvoet (TU Delft - Mechanical Engineering)

Contributor(s)

J.C.F. de Winter – Mentor (TU Delft - Human-Robot Interaction)

A. Anand – Mentor (TU Delft - Web Information Systems)

Y.B. Eisma – Graduation committee member (TU Delft - Human-Robot Interaction)

L.J.L. Leonhardt – Mentor (TU Delft - Web Information Systems)

Faculty
Mechanical Engineering
More Info
expand_more
Publication Year
2025
Language
English
Graduation Date
06-05-2025
Awarding Institution
Delft University of Technology
Programme
['Mechanical Engineering | Vehicle Engineering | Cognitive Robotics']
Faculty
Mechanical Engineering
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Numerous techniques have been developed in order to explain the reasoning process of black-box models. Among them is a class of models that are designed to be inherently interpretable: select-then-predict models (a.k.a. rationale-based models). These models are meant to explain their prediction by highlighting part of the input as evidence. The evidence, called the rationale, should consist of the most salient parts of the input text that contribute the most to the model's decision.
However, according to some recent studies, these models are not truly interpretable, because they do not provide faithful explanations (i.e., explanations that accurately reflect the true reasoning process of the model).
In this thesis we give a formal definition of the degree of unfaithfulness to quantify unfaithful behavior. Then, we introduce an experiment to test the faithfulness of select-then-predict models and prove that select-then-predict models can provide unfaithful rationales. Lastly, we introduce a loss function, which we call the unfaithfulness loss, which minimizes the degree of unfaithfulness of select-then-predict models and teaches them to produce more faithful and plausible rationales.

Files

License info not available