Evaluating Faithfulness of LLM Generated Explanations for Claims: Are Current Metrics Effective?

None, None

Evaluating Faithfulness of LLM Generated Explanations for Claims: Are Current Metrics Effective?

Analysing the Capabilities of Evaluation Metrics to Represent the Difference Between Generated and Expert-written Explanations

Bachelor Thesis (2025)

Author(s)

B. Marinov (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

P.K. Murukannaiah – Mentor (TU Delft - Interactive Intelligence)

S. Mukherjee – Mentor (TU Delft - Interactive Intelligence)

X. Zhang – Graduation committee member (TU Delft - Pattern Recognition and Bioinformatics)

Faculty

Electrical Engineering, Mathematics and Computer Science

Benchmarking Large Language Models (LLMs) Fact-checking Correlation Analysis Hallucinations Metrics Analysis Robustness Analysis Faithfulness Evaluation Explanation Quality

To reference this document use:

https://resolver.tudelft.nl/uuid:085205d5-62ba-4395-a160-eae4b7259e51

More Info

expand_more

Publication Year

2025

Language

English

Graduation Date

26-06-2025

Awarding Institution

Delft University of Technology

Project

['CSE3000 Research Project']

Programme

['Computer Science and Engineering']

Faculty

Electrical Engineering, Mathematics and Computer Science

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Large Language Models (LLMs) are increasingly used to generate fact-checking explanations, but evaluating how faithful these justifications are remains a major challenge. In this paper, we examine how well four popular automatic metrics—G-Eval, UniEval, FactCC, and QAGs—capture faithfulness compared to expert-written explanations. We look at how these metrics agree with each other, how they correlate with explanation similarity, and how they respond to controlled errors.Our findings show that while some metrics like UniEval and FactCC show some sensitivity to noise and partial alignment with expert reasoning, none of them reliably catch hallucinations or consistently reflect true faithfulness. Manual analysis also reveals that metric behavior varies depending on the type and structure of the claim. Overall, current metrics are only moderately effective and often biased toward the style of LLM-generated text. This study points to the need for more reliable, context-aware evaluation methods and offers practical insights for improving how we measure faithfulness in fact-checking tasks.

Files

Evaluating_Faithfulness_of_LLM... (pdf)

(pdf | 0.568 Mb)

License info not available