Explainable Fact-Checking with LLMs

None, None

Explainable Fact-Checking with LLMs

How do different LLMs compare in their rationales?

Bachelor Thesis (2025)

Author(s)

M. Bordea (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

X. Zhang – Graduation committee member (TU Delft - Pattern Recognition and Bioinformatics)

Pradeep Murukannaiah – Mentor (TU Delft - Interactive Intelligence)

S. Mukherjee – Mentor (TU Delft - Interactive Intelligence)

Faculty

Electrical Engineering, Mathematics and Computer Science

To reference this document use:

https://resolver.tudelft.nl/uuid:58d31846-5878-44bd-a1a3-0290e449c179

More Info

expand_more

Publication Year

2025

Language

English

Graduation Date

26-06-2025

Awarding Institution

Delft University of Technology

Project

['CSE3000 Research Project']

Programme

['Computer Science and Engineering']

Faculty

Electrical Engineering, Mathematics and Computer Science

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Large Language Models (LLMs) are becoming more commonplace in today's society. However their adoption rate, especially in the fact checking field, is being slowed down by the distrust in their thinking process and the rationales leading to the results. In crucial moments the justifications behind a verdict are more important than the verdict itself. However, LLMs often produce explanations that are not grounded in the provided evidence, leading to hallucinations and reduced trust in their outputs. This paper aims to show exactly the level the LLMs have reached in both the faithfulness of their explanations, based on some provided facts, and the correctness of their explanations. To investigate this, multiple LLMs are asked to assign a label to a claim based on some evidence provided from two datasets of varying complexity: HoVer and QuanTemp. The outputs are then evaluated both manually and by another LLM to evaluate how well the LLM relates to the evidence and if the LLM hallucinates in some parts of its responses. The results reveal that while some models demonstrate high correctness in label assignment, faithfulness in explanations varies significantly across models and evidence types. The outcomes of this experiment aim to inform both LLM developers and fact-checking researchers about the current limitations of LLMs in response quality while also showing which areas require further improvements to become mainstream.

Files

Research_paper_Final_MateiBord... (pdf)

(pdf | 0.487 Mb)

License info not available