Large Language Models for Reviewing Research Papers

Evaluating Claim-Level Completeness in Machine Learning Research

Bachelor Thesis (2026)
Author(s)

S.I. Simeonova (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

D.M.J. Tax – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

C. Hao – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

N. Tömen – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

H.S. Hung – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

K.A. Hildebrandt – Graduation committee member (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Faculty
Electrical Engineering, Mathematics and Computer Science
More Info
expand_more
Publication Year
2026
Language
English
Graduation Date
23-06-2026
Awarding Institution
Delft University of Technology
Project
CSE3000 Research Project
Programme
Computer Science and Engineering
Faculty
Electrical Engineering, Mathematics and Computer Science
Downloads counter
7
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Scientific peer review is an important part of the scientific process. However, the growing number of submissions has sparked interest in automated review tools. Recent work has shown that Large Language Models (LLMs) can generate reviews and evaluate author-provided checklists, yet it is unclear to what extent they can independently identify the scientific claims that are made in papers and perform structured reviews. This thesis investigates whether an LLM can automatically extract scientific claims from research papers in the machine learning field and then complete the NeurIPS Checklist without relying on author-written justifications. The evaluation focuses on claim extraction accuracy, preserving the semantic meaning of claims, and agreement between LLM-generated checklist annotations and human judgment. Gemini 3 Flash's claim extraction and checklist annotations are compared against human ground-truth annotations on NeurIPS 2024 papers. The results show that the model successfully identifies primary claims of papers, with a recall of 0.99 and precision of 0.75. Most errors are caused by over-segmentation or incorrect classification. For checklist annotation, the system achieves a mean accuracy of 0.85 and a mean Cohen's Kappa of 0.58 compared to human annotations. Agreement is strongest for objective checklist criteria. These findings indicate that LLMs can effectively support claim-based scientific review, but are not advanced enough to fully replace expert reviewers.

Files

License info not available