C. Hao | TU Delft Repository

Analysis of results in the ML research field

Investigating the Efficacy of LLMs in Extracting Stated Research Limitations

Bachelor thesis (2026) - A.E. Predoi, D.M.J. Tax, C. Hao, H.S. Hung, N. Tömen, K.A. Hildebrandt

The rapid growth of Machine Learning research has overwhelmed traditional peer-review systems, leading to concerns regarding reviewer fatigue and the consistency of scientific evaluation. While Large Language Models (LLMs) are being explored as potential assistants for quality assessment, their ability to objectively verify specific scientific criteria—such as those in the NeurIPS Paper Checklist—remains unproven. This checklist serves as a structured self-auditing framework that mandates authors to explicitly declare critical details, including potential negative societal impacts, exact hyperparameter tuning ranges, and clear definitions of theoretical assumptions or limitations. This study investigates the core question: “How well can an LLM extract the limitations described in scientific papers?” Using a manually annotated dataset of 78 papers, this research evaluates the accuracy of LLMs in extracting limitations stated by authors. Our findings reveal that while the LLM achieves perfect accuracy (100\%) in detecting the presence of dedicated limitation sections, its performance in textual extraction is more nuanced. For explicit limitations, the model demonstrates high recall (0.91) but moderate precision (0.71), frequently over-extracting context. Furthermore, when tasked with extracting implicit limitations from papers lacking dedicated sections, both recall (0.71) and precision (0.69) decline. Notably, we found that a major bottleneck in unstructured text is getting the LLM to look at the actual weakness instead of getting distracted by subsequent sentences talking about future work. By comparing LLM performance against a human-verified ground truth, this work provides a feasibility study for automating high-stakes research quality assessments and identifies current bottlenecks in LLM reasoning for scientific auditing. ...

Evaluating the Ability of Large Language Models to Classify Scientific Papers as Empirical or Theoretical using the NeurIPS Checklist

Bachelor thesis (2026) - A. Wielinga, D.M.J. Tax, H.S. Hung, N. Tömen, C. Hao, K.A. Hildebrandt

As machine learning conferences such as NeurIPS expand rapidly, the manual classi-
fication and evaluation of responsible research checklists impose a significant burden on
reviewers. This study investigates the ability of Large Language Models (LLMs) to au-
tomatically classify research papers as empirical, theoretical, or hybrid, and to extract
checklist compliance data. Using a dataset of publicly available NeurIPS papers, we
designed an automated pipeline and evaluated its outputs against a human-annotated
ground truth. Our results demonstrate that the LLM achieves high accuracy in the
core classification task, reliably distinguishing the papers core methodology by iden-
tifying clear structural indicators like mathematical proofs and benchmark datasets.
Furthermore, the model excels at extracting objective checklist elements, performing
well on close-ended extraction tasks that rely on clear structural indicators. However,
performance noticeably decreased on structurally scattered or subjective criteria, such
as broader impacts and the declaration of AI usage. This drop highlights a limitation in
the model’s broader reading comprehension, as it struggles to merge contextual infor-
mation without explicit headers. Notably, this automated failure closely mirrors human
task ambiguity, as these exact subjective items also generated the lower inter-annotator
agreement among human annotators. Conclusively, while LLMs provide a highly con-
sistent baseline for classifying paper typologies and extracting explicit methodological
data, their reliance on structural cues indicates they should serve as assistive screening
tools rather than autonomous evaluators in academic peer review. ...

Analysis of Results in the ML Research Field

How well can an LLM decide the reproducibility of a paper?

Bachelor thesis (2026) - A.A. Opritoiu, D.M.J. Tax, C. Hao, H.S. Hung, N. Tömen, K.A. Hildebrandt

The recent surge in machine learning (ML) research has led to a record number of paper submissions, overwhelming the traditional peer-review process. Although conferences like NeurIPS have introduced reproducibility checklists to maintain scientific standards, manual verification of these claims is time-consuming and inconsistent. This study investigates the feasibility of using Large Language Models (LLMs) to automate the evaluation of paper reproducibility. By creating a ground-truth dataset through the manual annotation of NeurIPS papers, this study assesses the accuracy of LLMs in verifying author claims regarding code availability, hyperparameter transparency, and compute resources. The results compare LLM performance with manual labels to identify where automated tools succeed and where they fail to capture technical nuances. Ultimately, this research demonstrates that while LLMs can act as highly efficient administrative filters to streamline initial screening, they fail to reliably predict execution viability, highlighting the remaining boundaries of automated verification. ...

Large Language Models for Reviewing Research Papers

Evaluating Claim-Level Completeness in Machine Learning Research

Bachelor thesis (2026) - S.I. Simeonova, D.M.J. Tax, C. Hao, N. Tömen, H.S. Hung, K.A. Hildebrandt

Scientific peer review is an important part of the scientific process. However, the growing number of submissions has sparked interest in automated review tools. Recent work has shown that Large Language Models (LLMs) can generate reviews and evaluate author-provided checklists, yet it is unclear to what extent they can independently identify the scientific claims that are made in papers and perform structured reviews. This thesis investigates whether an LLM can automatically extract scientific claims from research papers in the machine learning field and then complete the NeurIPS Checklist without relying on author-written justifications. The evaluation focuses on claim extraction accuracy, preserving the semantic meaning of claims, and agreement between LLM-generated checklist annotations and human judgment. Gemini 3 Flash's claim extraction and checklist annotations are compared against human ground-truth annotations on NeurIPS 2024 papers. The results show that the model successfully identifies primary claims of papers, with a recall of 0.99 and precision of 0.75. Most errors are caused by over-segmentation or incorrect classification. For checklist annotation, the system achieves a mean accuracy of 0.85 and a mean Cohen's Kappa of 0.58 compared to human annotations. Agreement is strongest for objective checklist criteria. These findings indicate that LLMs can effectively support claim-based scientific review, but are not advanced enough to fully replace expert reviewers. ...

ACT-R in the military

A systematic review of Adaptive Control of Thought - Rational, a cognitive architecture in the military

Bachelor thesis (2025) - V.N. Loykens, C. Hao, B.J.W. Dudzik

This paper provides an overview into the use of ACT-R as a cognitive architecture in the military. ACT-R stands for Adaptive Control of Thought - Rational. It is a cognitive architecture, a framework for a human like AI program, that models the human mind. In this paper its use will be examined in the military. Through this literary survey an overview will be created of the military’s usage of ACT-R. The overview will answer the questions in which applications the military uses ACT-R and why they use ACT-R. It will bring understanding to the people of how ACT-R is used in the military. It will also give insight into where their tax money is being spent on. For the military an overview will come in handy in case ACT-R gets outdated. They will know what programs will need an update. The overview consists of three parts. A robotics operator manager, a test to determine the value of an officer managing multiple robots. The creation of intelligent tutoring systems for ship navigation and aircraft recognition. A supporting tool for analysts to help determine the value of information. ...

How suited are cognitive architectures for implementing moral reasoning? – a Systematic Literature Review

Bachelor thesis (2025) - W.S. Hajdas, B.J.W. Dudzik, C. Hao, C.R.M.M. Oertel Genannt Bierbach

This paper surveys nine studies that implement aspects of moral reasoning within cognitive architectures (CAs) or CA-inspired frameworks. Its primary aim is to assess the viability of this approach for future research and to clarify the state of the domain. Two research paradigms emerge: (1) modeling human moral reasoning and (2) constructing artificial moral agents. Despite this distinction, all studies face similar challenges: fragmented reuse (each employs a different architecture), limited pre-programmed behaviors, and the absence of standardized benchmarks or metrics. Researchers remain optimistic about the explainability of their systems' behaviors and inner workings, yet often they acknowledge significant scalability and validation hurdles. Overall, CAs currently support only small-scale experiments; substantial further research – both empirical and into the theoretical basis of the field – is needed before these systems can attain real-world relevance. ...

Modeling Episodic Memory in Cognitive Architectures

A Comparative Study of Soar and Xapagy

Bachelor thesis (2025) - H. Xie, C. Hao, B.J.W. Dudzik, C.R.M.M. Oertel Genannt Bierbach

Episodic memory (EM) -- the capacity to recall past experiences situated in time and context -- is a critical component of intelligent behavior. Although several cognitive architectures (CAs) have incorporated mechanisms inspired by episodic memory, implementations vary widely in structure, mechanisms, and integration with other cognitive functions. While prior work has reviewed episodic memory across a range of architectures in a high-level manner, detailed, structured comparisons among specific systems remain lacking. This study presents a focused comparative analysis of modeling episodic memory in two contrasting cognitive architectures: Soar, a symbolic, rule-based, general-purpose system, and Xapagy, a system designed specifically for narrative reasoning, relying on direct, unprocessed recordings of autobiographical events. By analyzing the representations, structures, and mechanisms of episodic memory in these two systems, this study highlights important design trade-offs and distinct assumptions about the role of episodic memory in cognition and its modeling approaches in CAs. ...

An Analysis of ACT-R and CLARION Representing Heuristic Strategies for Consumer Decision-Making

A Systematic Literature Review

Bachelor thesis (2025) - W.J.P.L. van de Sanden, B.J.W. Dudzik, C. Hao, C.R.M.M. Oertel Genannt Bierbach

Heuristic strategies are an integral part of consumer decision-making. Heuristics serve as mental shortcuts that reduce cognitive effort, simplifying consumer decisions. To go from qualitative insights into these heuristics to quantitative data, a cognitive architecture must represent these heuristic strategies to understand consumer behavior better. This study will focus on the cognitive architectures ACT-R and CLARION since there is an interesting distinction in how they structure symbolic (explicit) and subsymbolic (implicit) cognition, influencing how they represent heuristic behavior. Currently, there is no systematic overview and comparison of how ACT-R and CLARION represent heuristics relevant to consumer decision-making. This paper aims to fill this knowledge gap by performing a systematic literature review on papers that contain an ACT-R or CLARION representation of heuristics relevant to consumer decision-making. The review uses four databases for the literature search: Scopus, Web of Science, IEEE Xplore, and ACM Digital Library. In total, 58 records have been screened, and 12 records have been included in the review. The review shows that ACT-R’s strength relies on representing heuristics by sequentially executing rule-based heuristics, while CLARION focuses on representing similarity- based heuristics by using bottom-up activation from its implicit layers. The results show a pattern in which the architectural structure mainly determines which heuristic strategies have been represented. ...

Human-like AI in Strategy Games: Guided by Playstyle Profiling and Player Perception

Master thesis (2025) - T.E. van Ham, A. Zgonnikov, C. Hao, Maxim Mozgovoy, J.M. Prendergast

Strategy games provide a compelling testbed for developing human-like computer agents, with applications that extend beyond gaming into fields requiring adaptive and socially intelligent AI. In these games, players tend to enjoy and engage more deeply with AI opponents that not only provide a challenge but also behave in ways that resemble human thinking and decision-making. However, despite progress in developing such agents, there is still no standard approach for evaluating how human-like these opponents truly are—making it difficult to assess and improve their design. Here I show that strategy game opponents having more human-like game-level playstyles does not necessarily lead to them being more believable (perceived as human-like by human players).

By developing a turn-based strategy game and evaluating Hierarchical Reinforcement Learning (HRL) agents of varying complexity, I assessed both their behavioural similarity to human players and how believable they were perceived to be by human players. This research introduces a new approach for understanding player behaviour using behaviour vectors composed of three high-level metrics—Aggressiveness, Management, and Exploration—consistent with existing literature. These metrics are designed to be broadly applicable across strategy games, enabling consistent comparison between human and AI opponents, as well as across different games and agents. The findings demonstrate that while HRL agents can replicate human-like playstyles without using human training data, players judge human-likeness more on perceived intelligence and fairness. This suggests that creating truly human-like AI opponents requires not just replicating human game-level playstyles, but designing agents that align with players' expectations for intelligent and fair decision-making. ...

What if fanfiction, but also coding: Investigating cultural differences in fanfiction writing and reviewing with machine learning methods

How has the portrayal of female characters in fanfiction evolved in response to the #MeToo movement and fourth-wave feminism, as analyzed with the help of NLP techniques?

Bachelor thesis (2025) - I. Marinescu, H.S. Hung, E. Eisemann, C. Hao, I. Kondyurin

This paper explores how the portrayal of female characters in fanfiction evolved in response to the #MeToo movement and fourth-wave feminism, with the aim of assessing whether the impact of the awareness of the campaign was broad enough to visibly alter how the average author portrays women in narrative contexts. To analyze these trends, fanfiction data from Archive of Our Own (AO3) spanning 2015–2019 was parsed, and two Natural Language Processing (NLP) pipelines — Word2Vec and GloVe, and BERT — were developed. The study finds that bias scores, aggregated through formulas created to compare gendered associations, show a stronger stereotypization of women before 2017 compared to after. Furthermore, a similar trend is discovered in the representation of women in fanfiction. While the BERT pipeline proved most effective for capturing contextual nuances, it is significantly limited by its reliance on binary labels and computational intensity. This further indicates the need for more inclusive and sustainable methods, making the Word2Vec/GloVe models more appropriate for this task. The paper concludes with recommendations for future work, including broader representation, longer-term analysis, and enhanced detection of evolving language patterns. ...

The impact of emotional journeys on fanfiction popularity

A computational analysis of linear correlations between emotional behavior and popularity

Bachelor thesis (2025) - J. van der Weijden, C. Hao, I. Kondyurin, H.S. Hung, E. Eisemann

Fanfiction writers always look for ways to make their stories more engaging. Analyzing what influences the popularity of fanfiction provides insights into readers' preferences and allows writers to tailor to these. This paper attempts to find linear correlations between fanfiction stories and the emotional journey of their characters. It does so by computationally extracting these journeys from 319 Good Omens fanfiction stories, defining and extracting several features from them and using simple linear regression to determine their correlation to fanfiction popularity. Five features were found to have a significant influence on fanfiction popularity. It was also determined that readers prefer characters that have low emotional fluctuations in their behavior. ...

Exploring Genre Preferences and Audience Engagement in Multilingual Fanfiction

A Study of Popularity and Preferences

Bachelor thesis (2025) - J.Q.Q. Ye, E. Eisemann, H.S. Hung, C. Hao, I. Kondyurin

This study investigates how genre preferences and sentiment influence fanfiction popularity across multiple languages, focusing on English, Mandarin, Russian, and Spanish datasets. Leveraging advanced natural language processing techniques, including multilingual sentiment analysis, genre classification, and topic modeling, this research explores the interplay between cultural and linguistic factors in storytelling. Preprocessing steps, such as translation and named entity recognition, ensured consistency and reduced noise across the multilingual dataset. Key findings reveal cross-linguistic patterns, such as the popularity of genres like Alter- nate Universe and Romance, alongside cultural distinctions in sentiment and engagement. This work contributes to computational fan studies by demonstrating how linguistic and cultural factors influence storytelling trends and audience preferences in fanfiction. ...

What if fanfiction, but also coding: Investigating cultural differences in fanfiction writing and reviewing with machine learning methods

Fine Tuning a BERT-based Pre-Trained Language Model for Named Entity Extraction within the Domain of Fanfiction

Bachelor thesis (2025) - N.P.A. Kindt, H.S. Hung, C. Hao, I. Kondyurin, E. Eisemann

The introduction of Pretrained Language Models (PLMs) has revolutionised the field of Natural Language Processing (NLP) and paved the way for many new, exciting large-scale studies for various areas of research. One such field presents itself in the emerging digital literary corpus that is fanfiction, providing research opportunities within the fields of (NLP), Computational (Socio-) Linguistics, the Social Sciences and Digital Humanities. However, because of the unique linguistic characteristics of this literary domain many modern NLP solutions utilizing PLMs encounter difficulties when applied on fanfiction texts. This paper aims to indicate that the performance of various NLP tasks performed by PLMs on fanfiction texts can be improved by applying Domain Adaptive Pre-Training (DAPT) to PLMs. A case-study is performed to show that the performance of a BERT-based PLM can be improved for the downstream NLP task of Named Entity Recognition (NER) by applying supervised domain specific fine-tuning. While we gain a 6% increase in F1 score performance, we are sceptical about these results due to the limited amount of annotated data available leading to the model overfitting and show a lack of capacity to generalize to unseen data from the CoNLL NER dataset. ...

Visualizing Collaboration with Superstars

A Novel Approach to Visualizing Collaboration

Bachelor thesis (2024) - P.S. Hull, H.S. Hung, V. Agarwal, C. Hao

Superstar researchers - those who author research papers which are far more widely cited than average - are generally well-respected within their fields and are frequently sought by new researchers for advice on career development and for collaborations. Though the effect of collaboration with superstar researchers on associated researchers is not well-researched yet, it is believed that collaboration with superstar researchers increases the research output of associated researchers, but may decrease their originality and innovation. This project explores a method of visualizing the career development of researchers associated with superstars. Using a dataset of authors, papers, and paper citations, a graph has been created with papers as edges and nodes as authors to visualize the career development of associate authors after collaboration with a superstar. The utility of this visualization is evaluated using heuristics. ...

Independent Thinkers and Scientific Progress

An Analysis of Superstar Influence on Computer Science Research Dynamics

Bachelor thesis (2024) - F.J. Płonka, H.S. Hung, C. Hao, V. Agarwal, M. Khosla

In the scientific community, a few prominent researchers, known as "superstars," receive most of the attention, citations, and resources. However, it is unclear whether they promote true innovation. This study replicates and extends previous work analyzing how superstars influence their collaborators, focusing on the field of computer science. Using the Semantic Scholar Academic Graph dataset, we confirm that while connected researchers in computer science tend to publish more and receive more citations, their ideas are often less innovative. Unlike in the work we replicate, however, we observe this effect even before dissociating the connected researchers from superstars. We also develop a new metric to further clarify the impact of superstars on research diversity. The findings provide insights into the role of superstars in scientific innovation. ...

The dissociation of researchers from superstars through a new metric

Bachelor thesis (2024) - F.T. Marchidan, H.S. Hung, C. Hao, V. Agarwal

This study introduces a new metric for evaluating the disassociation between superstar and non-superstar researchers. Superstar researchers are defined as those in the top 0.1\% by h-index. Leveraging a large dataset, this paper analyzes the data and aims to flatten the discrepancy between superstars and non-superstars, in terms of innovation and popularity. Some authors that publish innovative papers and who haven't collaborated with superstars, tend to be left in the shadows, compared to the ones that have collaborated with superstars from an early stage. The new metric indicates the disassociation between such authors, by factoring in certain parameters that were put into perspective with the help of a Multiple Linear Regression model. The findings reveal significant differences in dissociation scores between researchers and superstar researchers, offering new insights into the dynamics of academic innovation and collaboration. This metric provides a robust tool to identify where an author stands in terms of dissociation and what needs to be done to diminish the discrepancy. ...