Analysis of Results in the ML Research Field
How well can an LLM decide the reproducibility of a paper?
A.A. Opritoiu (TU Delft - Electrical Engineering, Mathematics and Computer Science)
D.M.J. Tax – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)
C. Hao – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)
H.S. Hung – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)
N. Tömen – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)
K.A. Hildebrandt – Graduation committee member (TU Delft - Electrical Engineering, Mathematics and Computer Science)
More Info
expand_more
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.
Abstract
The recent surge in machine learning (ML) research has led to a record number of paper submissions, overwhelming the traditional peer-review process. Although conferences like NeurIPS have introduced reproducibility checklists to maintain scientific standards, manual verification of these claims is time-consuming and inconsistent. This study investigates the feasibility of using Large Language Models (LLMs) to automate the evaluation of paper reproducibility. By creating a ground-truth dataset through the manual annotation of NeurIPS papers, this study assesses the accuracy of LLMs in verifying author claims regarding code availability, hyperparameter transparency, and compute resources. The results compare LLM performance with manual labels to identify where automated tools succeed and where they fail to capture technical nuances. Ultimately, this research demonstrates that while LLMs can act as highly efficient administrative filters to streamline initial screening, they fail to reliably predict execution viability, highlighting the remaining boundaries of automated verification.