Investigating Unfaithful Behavior in Neural Rationale Models

None, None

Investigating Unfaithful Behavior in Neural Rationale Models

Master Thesis (2025)

Author(s)

L.E. Holvoet (TU Delft - Mechanical Engineering)

Contributor(s)

J.C.F. de Winter – Mentor (TU Delft - Human-Robot Interaction)

A. Anand – Mentor (TU Delft - Web Information Systems)

Y.B. Eisma – Graduation committee member (TU Delft - Human-Robot Interaction)

L.J.L. Leonhardt – Mentor (TU Delft - Web Information Systems)

Faculty

Mechanical Engineering

NLP Interpretability Rationales Faithfulness Select-then-Predict

To reference this document use:

https://resolver.tudelft.nl/uuid:7456975b-9067-4af7-bddf-5aad664590bc

More Info

expand_more

Publication Year

2025

Language

English

Graduation Date

06-05-2025

Awarding Institution

Delft University of Technology

Programme

['Mechanical Engineering | Vehicle Engineering | Cognitive Robotics']

Faculty

Mechanical Engineering

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Numerous techniques have been developed in order to explain the reasoning process of black-box models. Among them is a class of models that are designed to be inherently interpretable: select-then-predict models (a.k.a. rationale-based models). These models are meant to explain their prediction by highlighting part of the input as evidence. The evidence, called the rationale, should consist of the most salient parts of the input text that contribute the most to the model's decision.
However, according to some recent studies, these models are not truly interpretable, because they do not provide faithful explanations (i.e., explanations that accurately reflect the true reasoning process of the model).
In this thesis we give a formal definition of the degree of unfaithfulness to quantify unfaithful behavior. Then, we introduce an experiment to test the faithfulness of select-then-predict models and prove that select-then-predict models can provide unfaithful rationales. Lastly, we introduce a loss function, which we call the unfaithfulness loss, which minimizes the degree of unfaithfulness of select-then-predict models and teaches them to produce more faithful and plausible rationales.

Files

Thesis_Laura_Holvoet_final.pdf

(pdf | 4.39 Mb)

License info not available