Iterative Prompt Refinement via Knowledge Alignment: A Case Study in Systematic Review Screening
A.S. Kuiper (TU Delft - Electrical Engineering, Mathematics and Computer Science)
J. Yang – Mentor (TU Delft - Web Information Systems)
Christoph Lofi – Graduation committee member (TU Delft - Web Information Systems)
P.K. Murukannaiah – Graduation committee member (TU Delft - Interactive Intelligence)
More Info
expand_more
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.
Abstract
Applying Large Language Models (LLMs) to high-stakes classification tasks like systematic review screening is challenged by prompt sensitivity and a lack of transparency. We introduce IMAPR (Iterative Multi-signal Adaptive Prompt Refinement), a novel framework where a single LLM uses its own internal signals to iteratively refine its prompts, improving classification robustness and reliability. Unlike black-box optimizers that tune the prompts using only external scores, IMAPR is a white-box approach that diagnoses why a prediction failed using three internal signals: model confidence, a rationale, and a knowledge alignment score that checks whether the evidence cited in the rationale actually covers the user-defined inclusion criteria. We evaluate IMAPR on a real-world biomedical screening task, comparing it against strong baselines including GPO and StraGo. IMAPR outperforms the best baseline (GPO) by 8.8% in Macro-F1 while maintaining high, stable recall across runs. Across seven LLMs, IMAPR yields an average 9.2% improvement in Macro-F1 An ablation shows that knowledge-alignment acts as a recall safeguard: removing it leaves Macro-F1 similar but degrades recall, reducing reliability for screening. These results suggest that diagnostic, signal-driven prompt refinement is a practical alternative to black-box optimization for transparent, dependable LLM screening systems.