Analysis of results in the ML research field

None, None

Analysis of results in the ML research field

Investigating the Efficacy of LLMs in Extracting Stated Research Limitations

Bachelor Thesis (2026)

Author(s)

A.E. Predoi (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

D.M.J. Tax – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

C. Hao – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

H.S. Hung – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

N. Tömen – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

K.A. Hildebrandt – Graduation committee member (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Faculty

Electrical Engineering, Mathematics and Computer Science

Large language models Information extraction Scientific auditing

To reference this document use

https://resolver.tudelft.nl/uuid:bebea6a3-0921-440a-8a29-7aa7f55cac4e

More Info

expand_more

Publication Year

2026

Language

English

Graduation Date

23-06-2026

Awarding Institution

Delft University of Technology

Project

CSE3000 Research Project

Programme

Computer Science and Engineering

Faculty

Electrical Engineering, Mathematics and Computer Science

Downloads counter

4

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

The rapid growth of Machine Learning research has overwhelmed traditional peer-review systems, leading to concerns regarding reviewer fatigue and the consistency of scientific evaluation. While Large Language Models (LLMs) are being explored as potential assistants for quality assessment, their ability to objectively verify specific scientific criteria—such as those in the NeurIPS Paper Checklist—remains unproven. This checklist serves as a structured self-auditing framework that mandates authors to explicitly declare critical details, including potential negative societal impacts, exact hyperparameter tuning ranges, and clear definitions of theoretical assumptions or limitations. This study investigates the core question: “How well can an LLM extract the limitations described in scientific papers?” Using a manually annotated dataset of 78 papers, this research evaluates the accuracy of LLMs in extracting limitations stated by authors. Our findings reveal that while the LLM achieves perfect accuracy (100\%) in detecting the presence of dedicated limitation sections, its performance in textual extraction is more nuanced. For explicit limitations, the model demonstrates high recall (0.91) but moderate precision (0.71), frequently over-extracting context. Furthermore, when tasked with extracting implicit limitations from papers lacking dedicated sections, both recall (0.71) and precision (0.69) decline. Notably, we found that a major bottleneck in unstructured text is getting the LLM to look at the actual weakness instead of getting distracted by subsequent sentences talking about future work. By comparing LLM performance against a human-verified ground truth, this work provides a feasibility study for automating high-stakes research quality assessments and identifies current bottlenecks in LLM reasoning for scientific auditing.

Files

Research_paper_Alexia_Predoi.p... (pdf)

(pdf | 1.04 Mb)

License info not available