Analysis of Results in the ML Research Field

How well can an LLM decide the reproducibility of a paper?

Bachelor Thesis (2026)
Author(s)

A.A. Opritoiu (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

D.M.J. Tax – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

C. Hao – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

H.S. Hung – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

N. Tömen – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

K.A. Hildebrandt – Graduation committee member (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Faculty
Electrical Engineering, Mathematics and Computer Science
More Info
expand_more
Publication Year
2026
Language
English
Graduation Date
23-06-2026
Awarding Institution
Delft University of Technology
Project
CSE3000 Research Project
Programme
Computer Science and Engineering
Faculty
Electrical Engineering, Mathematics and Computer Science
Downloads counter
7
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

The recent surge in machine learning (ML) research has led to a record number of paper submissions, overwhelming the traditional peer-review process. Although conferences like NeurIPS have introduced reproducibility checklists to maintain scientific standards, manual verification of these claims is time-consuming and inconsistent. This study investigates the feasibility of using Large Language Models (LLMs) to automate the evaluation of paper reproducibility. By creating a ground-truth dataset through the manual annotation of NeurIPS papers, this study assesses the accuracy of LLMs in verifying author claims regarding code availability, hyperparameter transparency, and compute resources. The results compare LLM performance with manual labels to identify where automated tools succeed and where they fail to capture technical nuances. Ultimately, this research demonstrates that while LLMs can act as highly efficient administrative filters to streamline initial screening, they fail to reliably predict execution viability, highlighting the remaining boundaries of automated verification.

Files

Research_paper-16.pdf
(pdf | 0.58 Mb)
License info not available