Evaluating the Ability of Large Language Models to Classify Scientific Papers as Empirical or Theoretical using the NeurIPS Checklist

None, None

Evaluating the Ability of Large Language Models to Classify Scientific Papers as Empirical or Theoretical using the NeurIPS Checklist

Bachelor Thesis (2026)

Author(s)

A. Wielinga (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

D.M.J. Tax – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

H.S. Hung – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

N. Tömen – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

C. Hao – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

K.A. Hildebrandt – Graduation committee member (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Faculty

Electrical Engineering, Mathematics and Computer Science

LLM CSE Peer Review

To reference this document use

https://resolver.tudelft.nl/uuid:b808ee19-81a9-4fa2-8fce-1a3171a24346

More Info

expand_more

Publication Year

2026

Language

English

Graduation Date

23-06-2026

Awarding Institution

Delft University of Technology

Project

CSE3000 Research Project

Programme

Computer Science and Engineering

Faculty

Electrical Engineering, Mathematics and Computer Science

Downloads counter

4

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

As machine learning conferences such as NeurIPS expand rapidly, the manual classi-
fication and evaluation of responsible research checklists impose a significant burden on
reviewers. This study investigates the ability of Large Language Models (LLMs) to au-
tomatically classify research papers as empirical, theoretical, or hybrid, and to extract
checklist compliance data. Using a dataset of publicly available NeurIPS papers, we
designed an automated pipeline and evaluated its outputs against a human-annotated
ground truth. Our results demonstrate that the LLM achieves high accuracy in the
core classification task, reliably distinguishing the papers core methodology by iden-
tifying clear structural indicators like mathematical proofs and benchmark datasets.
Furthermore, the model excels at extracting objective checklist elements, performing
well on close-ended extraction tasks that rely on clear structural indicators. However,
performance noticeably decreased on structurally scattered or subjective criteria, such
as broader impacts and the declaration of AI usage. This drop highlights a limitation in
the model’s broader reading comprehension, as it struggles to merge contextual infor-
mation without explicit headers. Notably, this automated failure closely mirrors human
task ambiguity, as these exact subjective items also generated the lower inter-annotator
agreement among human annotators. Conclusively, while LLMs provide a highly con-
sistent baseline for classifying paper typologies and extracting explicit methodological
data, their reliance on structural cues indicates they should serve as assistive screening
tools rather than autonomous evaluators in academic peer review.

Files

FinalPaperAdamRP.pdf

(pdf | 1.54 Mb)

License info not available