AW

A. Wielinga

info

Please Note

1 records found

As machine learning conferences such as NeurIPS expand rapidly, the manual classi-
fication and evaluation of responsible research checklists impose a significant burden on
reviewers. This study investigates the ability of Large Language Models (LLMs) to au-
tomatically classify research papers as empirical, theoretical, or hybrid, and to extract
checklist compliance data. Using a dataset of publicly available NeurIPS papers, we
designed an automated pipeline and evaluated its outputs against a human-annotated
ground truth. Our results demonstrate that the LLM achieves high accuracy in the
core classification task, reliably distinguishing the papers core methodology by iden-
tifying clear structural indicators like mathematical proofs and benchmark datasets.
Furthermore, the model excels at extracting objective checklist elements, performing
well on close-ended extraction tasks that rely on clear structural indicators. However,
performance noticeably decreased on structurally scattered or subjective criteria, such
as broader impacts and the declaration of AI usage. This drop highlights a limitation in
the model’s broader reading comprehension, as it struggles to merge contextual infor-
mation without explicit headers. Notably, this automated failure closely mirrors human
task ambiguity, as these exact subjective items also generated the lower inter-annotator
agreement among human annotators. Conclusively, while LLMs provide a highly con-
sistent baseline for classifying paper typologies and extracting explicit methodological
data, their reliance on structural cues indicates they should serve as assistive screening
tools rather than autonomous evaluators in academic peer review. ...