Large Language Models for Reviewing Research Papers
Evaluating Claim-Level Completeness in Machine Learning Research
S.I. Simeonova (TU Delft - Electrical Engineering, Mathematics and Computer Science)
D.M.J. Tax – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)
C. Hao – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)
N. Tömen – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)
H.S. Hung – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)
K.A. Hildebrandt – Graduation committee member (TU Delft - Electrical Engineering, Mathematics and Computer Science)
More Info
expand_more
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.
Abstract
Scientific peer review is an important part of the scientific process. However, the growing number of submissions has sparked interest in automated review tools. Recent work has shown that Large Language Models (LLMs) can generate reviews and evaluate author-provided checklists, yet it is unclear to what extent they can independently identify the scientific claims that are made in papers and perform structured reviews. This thesis investigates whether an LLM can automatically extract scientific claims from research papers in the machine learning field and then complete the NeurIPS Checklist without relying on author-written justifications. The evaluation focuses on claim extraction accuracy, preserving the semantic meaning of claims, and agreement between LLM-generated checklist annotations and human judgment. Gemini 3 Flash's claim extraction and checklist annotations are compared against human ground-truth annotations on NeurIPS 2024 papers. The results show that the model successfully identifies primary claims of papers, with a recall of 0.99 and precision of 0.75. Most errors are caused by over-segmentation or incorrect classification. For checklist annotation, the system achieves a mean accuracy of 0.85 and a mean Cohen's Kappa of 0.58 compared to human annotations. Agreement is strongest for objective checklist criteria. These findings indicate that LLMs can effectively support claim-based scientific review, but are not advanced enough to fully replace expert reviewers.