Analysis of Results in the ML Research Field

None, None

Analysis of Results in the ML Research Field

How well can an LLM decide the reproducibility of a paper?

Bachelor Thesis (2026)

Author(s)

A.A. Opritoiu (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

D.M.J. Tax – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

C. Hao – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

H.S. Hung – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

N. Tömen – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

K.A. Hildebrandt – Graduation committee member (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Faculty

Electrical Engineering, Mathematics and Computer Science

Large language models Reproducibility Peer review automation

To reference this document use

https://resolver.tudelft.nl/uuid:eb83c1ce-e468-440a-9b08-b3c9af168c5e

More Info

expand_more

Publication Year

2026

Language

English

Graduation Date

23-06-2026

Awarding Institution

Delft University of Technology

Project

CSE3000 Research Project

Programme

Computer Science and Engineering

Faculty

Electrical Engineering, Mathematics and Computer Science

Downloads counter

23

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

The recent surge in machine learning (ML) research has led to a record number of paper submissions, overwhelming the traditional peer-review process. Although conferences like NeurIPS have introduced reproducibility checklists to maintain scientific standards, manual verification of these claims is time-consuming and inconsistent. This study investigates the feasibility of using Large Language Models (LLMs) to automate the evaluation of paper reproducibility. By creating a ground-truth dataset through the manual annotation of NeurIPS papers, this study assesses the accuracy of LLMs in verifying author claims regarding code availability, hyperparameter transparency, and compute resources. The results compare LLM performance with manual labels to identify where automated tools succeed and where they fail to capture technical nuances. Ultimately, this research demonstrates that while LLMs can act as highly efficient administrative filters to streamline initial screening, they fail to reliably predict execution viability, highlighting the remaining boundaries of automated verification.

Files

Research_paper-16.pdf

(pdf | 0.58 Mb)

License info not available