Exploring Human-AI Synergy for Complex Claim Verification

Journal Article (2025)
Author(s)

S. Mukherjee (TU Delft - Interactive Intelligence)

C.M. Jonker (TU Delft - Interactive Intelligence)

P.K. Murukannaiah (TU Delft - Interactive Intelligence)

Research Group
Interactive Intelligence
DOI related publication
https://doi.org/10.3233/FAIA250620
More Info
expand_more
Publication Year
2025
Language
English
Research Group
Interactive Intelligence
Volume number
408
Pages (from-to)
2-15
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Combating widespread misinformation requires scalable and reliable fact-checking methods. Fact-checking involves several steps, including question generation, evidence retrieval, and veracity prediction. Importantly, fact-checking is well-suited to exploit hybrid intelligence since it requires both human expertise and AI’s large-scale information processing abilities. Thus, constructing an effective fact-checking pipeline requires a systematic understanding of the relative strengths and weaknesses of humans and AI in different steps of the fact-checking process. We investigate the ability of LLMs to perform the first step of the process, i.e., to generate pertinent questions for analyzing a claim. To evaluate the quality of the LLM-generated questions, we crowdsource a dataset in which 150 claims are annotated with questions (1) a novice fact-checker would ask and (2) a professional fact-checker would ask when fact-checking those claims. We study the effects of the human- and LLM-generated questions on evidence retrieval and veracity prediction. We find that LLMs are able to generate nuanced questions to verify a complex claim, but the final label prediction depends on the quality of the evidence corpus. However, the evidence collected by automated methods yields lower accuracy in the veracity prediction task than the evidence curated by experts.