Large Language Models for Reviewing Research Papers

None, None

Large Language Models for Reviewing Research Papers

Evaluating Claim-Level Completeness in Machine Learning Research

Bachelor Thesis (2026)

Author(s)

S.I. Simeonova (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

D.M.J. Tax – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

C. Hao – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

N. Tömen – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

H.S. Hung – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

K.A. Hildebrandt – Graduation committee member (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Faculty

Electrical Engineering, Mathematics and Computer Science

To reference this document use

https://resolver.tudelft.nl/uuid:e38a02ff-efca-4129-ae9e-cb9384e39bea

More Info

expand_more

Publication Year

2026

Language

English

Graduation Date

23-06-2026

Awarding Institution

Delft University of Technology

Project

CSE3000 Research Project

Programme

Computer Science and Engineering

Faculty

Electrical Engineering, Mathematics and Computer Science

Downloads counter

7

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Scientific peer review is an important part of the scientific process. However, the growing number of submissions has sparked interest in automated review tools. Recent work has shown that Large Language Models (LLMs) can generate reviews and evaluate author-provided checklists, yet it is unclear to what extent they can independently identify the scientific claims that are made in papers and perform structured reviews. This thesis investigates whether an LLM can automatically extract scientific claims from research papers in the machine learning field and then complete the NeurIPS Checklist without relying on author-written justifications. The evaluation focuses on claim extraction accuracy, preserving the semantic meaning of claims, and agreement between LLM-generated checklist annotations and human judgment. Gemini 3 Flash's claim extraction and checklist annotations are compared against human ground-truth annotations on NeurIPS 2024 papers. The results show that the model successfully identifies primary claims of papers, with a recall of 0.99 and precision of 0.75. Most errors are caused by over-segmentation or incorrect classification. For checklist annotation, the system achieves a mean accuracy of 0.85 and a mean Cohen's Kappa of 0.58 compared to human annotations. Agreement is strongest for objective checklist criteria. These findings indicate that LLMs can effectively support claim-based scientific review, but are not advanced enough to fully replace expert reviewers.

Files

RP_SimonaSimeonova_FINAL.pdf

(pdf | 0.79 Mb)

License info not available