Analysis of results in the ML research field

Investigating the Efficacy of LLMs in Extracting Stated Research Limitations

Bachelor Thesis (2026)
Author(s)

A.E. Predoi (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

D.M.J. Tax – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

C. Hao – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

H.S. Hung – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

N. Tömen – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

K.A. Hildebrandt – Graduation committee member (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Faculty
Electrical Engineering, Mathematics and Computer Science
More Info
expand_more
Publication Year
2026
Language
English
Graduation Date
23-06-2026
Awarding Institution
Delft University of Technology
Project
CSE3000 Research Project
Programme
Computer Science and Engineering
Faculty
Electrical Engineering, Mathematics and Computer Science
Downloads counter
4
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

The rapid growth of Machine Learning research has overwhelmed traditional peer-review systems, leading to concerns regarding reviewer fatigue and the consistency of scientific evaluation. While Large Language Models (LLMs) are being explored as potential assistants for quality assessment, their ability to objectively verify specific scientific criteria—such as those in the NeurIPS Paper Checklist—remains unproven. This checklist serves as a structured self-auditing framework that mandates authors to explicitly declare critical details, including potential negative societal impacts, exact hyperparameter tuning ranges, and clear definitions of theoretical assumptions or limitations. This study investigates the core question: “How well can an LLM extract the limitations described in scientific papers?” Using a manually annotated dataset of 78 papers, this research evaluates the accuracy of LLMs in extracting limitations stated by authors. Our findings reveal that while the LLM achieves perfect accuracy (100\%) in detecting the presence of dedicated limitation sections, its performance in textual extraction is more nuanced. For explicit limitations, the model demonstrates high recall (0.91) but moderate precision (0.71), frequently over-extracting context. Furthermore, when tasked with extracting implicit limitations from papers lacking dedicated sections, both recall (0.71) and precision (0.69) decline. Notably, we found that a major bottleneck in unstructured text is getting the LLM to look at the actual weakness instead of getting distracted by subsequent sentences talking about future work. By comparing LLM performance against a human-verified ground truth, this work provides a feasibility study for automating high-stakes research quality assessments and identifies current bottlenecks in LLM reasoning for scientific auditing.

Files

License info not available