Explainable Fact-Checking with Large Language Models

None, None

Explainable Fact-Checking with Large Language Models

How Prompt Style Variation affects Accuracy and Faithfulness in Claim Justifications

Bachelor Thesis (2025)

Author(s)

M. Serafeimidi (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

P.K. Murukannaiah – Mentor (TU Delft - Interactive Intelligence)

S. Mukherjee – Mentor (TU Delft - Interactive Intelligence)

X. Zhang – Graduation committee member (TU Delft - Pattern Recognition and Bioinformatics)

Faculty

Electrical Engineering, Mathematics and Computer Science

LLM Large language models Fact-checking Explainable fact-checking Prompting

To reference this document use:

https://resolver.tudelft.nl/uuid:ba8adb8c-c1de-4a58-825f-2af915fe033c

More Info

expand_more

Publication Year

2025

Language

English

Graduation Date

27-06-2025

Awarding Institution

Delft University of Technology

Project

['CSE3000 Research Project']

Programme

['Computer Science and Engineering']

Faculty

Electrical Engineering, Mathematics and Computer Science

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Large Language Models (LLMs) such as GPT-4 and LLaMA have demonstrated promising performance in fact-checking tasks, particularly in labeling the veracity of claims. However, the real-world utility of such fact-checking systems depends not only on label accuracy but also on the faithfulness of the justifications they provide. Prior work has explored various prompting strategies to elicit reasoning from LLMs, but most studies evaluate these styles in isolation or focus solely on veracity classification, neglecting the impact on explanation quality. This study addresses that gap by investigating how different prompt styles affect both the accuracy and the faithfulness of LLM-generated claim labelling and justifications. Seven established prompting strategies such as Chain-of-Thought, Role-Based, or Decompose-and-Verify, were tested across two datasets (QuanTemp and HoVer) using two efficient models: LLaMA 3.1:8B and GPT-4o-mini. Additionally, two novel prompt variants were introduced and all styles were tested under three label conditions to assess bias and explanation drift.

Files

ExplainableFactChecking_Prompt... (pdf)

(pdf | 8.72 Mb)

License info not available