Evaluating the Ability of Large Language Models to Classify Scientific Papers as Empirical or Theoretical using the NeurIPS Checklist
A. Wielinga (TU Delft - Electrical Engineering, Mathematics and Computer Science)
D.M.J. Tax – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)
H.S. Hung – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)
N. Tömen – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)
C. Hao – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)
K.A. Hildebrandt – Graduation committee member (TU Delft - Electrical Engineering, Mathematics and Computer Science)
More Info
expand_more
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.
Abstract
As machine learning conferences such as NeurIPS expand rapidly, the manual classi-
fication and evaluation of responsible research checklists impose a significant burden on
reviewers. This study investigates the ability of Large Language Models (LLMs) to au-
tomatically classify research papers as empirical, theoretical, or hybrid, and to extract
checklist compliance data. Using a dataset of publicly available NeurIPS papers, we
designed an automated pipeline and evaluated its outputs against a human-annotated
ground truth. Our results demonstrate that the LLM achieves high accuracy in the
core classification task, reliably distinguishing the papers core methodology by iden-
tifying clear structural indicators like mathematical proofs and benchmark datasets.
Furthermore, the model excels at extracting objective checklist elements, performing
well on close-ended extraction tasks that rely on clear structural indicators. However,
performance noticeably decreased on structurally scattered or subjective criteria, such
as broader impacts and the declaration of AI usage. This drop highlights a limitation in
the model’s broader reading comprehension, as it struggles to merge contextual infor-
mation without explicit headers. Notably, this automated failure closely mirrors human
task ambiguity, as these exact subjective items also generated the lower inter-annotator
agreement among human annotators. Conclusively, while LLMs provide a highly con-
sistent baseline for classifying paper typologies and extracting explicit methodological
data, their reliance on structural cues indicates they should serve as assistive screening
tools rather than autonomous evaluators in academic peer review.