Explainable Fact-Checking with Large Language Models

How Prompt Style Variation affects Accuracy and Faithfulness in Claim Justifications

Bachelor Thesis (2025)
Author(s)

M. Serafeimidi (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

Pradeep Murukannaiah – Mentor (TU Delft - Interactive Intelligence)

S. Mukherjee – Mentor (TU Delft - Interactive Intelligence)

X. Zhang – Graduation committee member (TU Delft - Pattern Recognition and Bioinformatics)

Faculty
Electrical Engineering, Mathematics and Computer Science
More Info
expand_more
Publication Year
2025
Language
English
Graduation Date
27-06-2025
Awarding Institution
Delft University of Technology
Project
['CSE3000 Research Project']
Programme
['Computer Science and Engineering']
Faculty
Electrical Engineering, Mathematics and Computer Science
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Large Language Models (LLMs) such as GPT-4 and LLaMA have demonstrated promising performance in fact-checking tasks, particularly in labeling the veracity of claims. However, the real-world utility of such fact-checking systems depends not only on label accuracy but also on the faithfulness of the justifications they provide. Prior work has explored various prompting strategies to elicit reasoning from LLMs, but most studies evaluate these styles in isolation or focus solely on veracity classification, neglecting the impact on explanation quality. This study addresses that gap by investigating how different prompt styles affect both the accuracy and the faithfulness of LLM-generated claim labelling and justifications. Seven established prompting strategies such as Chain-of-Thought, Role-Based, or Decompose-and-Verify, were tested across two datasets (QuanTemp and HoVer) using two efficient models: LLaMA 3.1:8B and GPT-4o-mini. Additionally, two novel prompt variants were introduced and all styles were tested under three label conditions to assess bias and explanation drift.

Files

License info not available