Iterative Prompt Refinement via Knowledge Alignment: A Case Study in Systematic Review Screening

None, None

Iterative Prompt Refinement via Knowledge Alignment: A Case Study in Systematic Review Screening

Master Thesis (2025)

Author(s)

A.S. Kuiper (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

J. Yang – Mentor (TU Delft - Web Information Systems)

Christoph Lofi – Graduation committee member (TU Delft - Web Information Systems)

P.K. Murukannaiah – Graduation committee member (TU Delft - Interactive Intelligence)

Faculty

Electrical Engineering, Mathematics and Computer Science

Large Language Models (LLMs) Prompt Engineering Systematic review screening Prompt refinement Knowledge Alignment

To reference this document use:

https://resolver.tudelft.nl/uuid:447a0e3c-aaee-4d00-af3e-c4e7d0e48408

More Info

expand_more

Publication Year

2025

Language

English

Graduation Date

25-08-2025

Awarding Institution

Delft University of Technology

Programme

['Computer Science']

Faculty

Electrical Engineering, Mathematics and Computer Science

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Applying Large Language Models (LLMs) to high-stakes classification tasks like systematic review screening is challenged by prompt sensitivity and a lack of transparency. We introduce IMAPR (Iterative Multi-signal Adaptive Prompt Refinement), a novel framework where a single LLM uses its own internal signals to iteratively refine its prompts, improving classification robustness and reliability. Unlike black-box optimizers that tune the prompts using only external scores, IMAPR is a white-box approach that diagnoses why a prediction failed using three internal signals: model confidence, a rationale, and a knowledge alignment score that checks whether the evidence cited in the rationale actually covers the user-defined inclusion criteria. We evaluate IMAPR on a real-world biomedical screening task, comparing it against strong baselines including GPO and StraGo. IMAPR outperforms the best baseline (GPO) by 8.8% in Macro-F1 while maintaining high, stable recall across runs. Across seven LLMs, IMAPR yields an average 9.2% improvement in Macro-F1 An ablation shows that knowledge-alignment acts as a recall safeguard: removing it leaves Macro-F1 similar but degrades recall, reducing reliability for screening. These results suggest that diagnostic, signal-driven prompt refinement is a practical alternative to black-box optimization for transparent, dependable LLM screening systems.

Files

Thesis.pdf

(pdf | 0.488 Mb)

License info not available