What if fanfiction, but also coding: Investigating cultural differences in fanfiction writing and reviewing with machine learning methods

Fine Tuning a BERT-based Pre-Trained Language Model for Named Entity Extraction within the Domain of Fanfiction

Bachelor Thesis (2025)
Author(s)

N.P.A. Kindt (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

H.S. Hung – Mentor (TU Delft - Pattern Recognition and Bioinformatics)

C. Hao – Mentor (TU Delft - Pattern Recognition and Bioinformatics)

I. Kondyurin – Mentor (TU Delft - Pattern Recognition and Bioinformatics)

E. Eisemann – Graduation committee member (TU Delft - Computer Graphics and Visualisation)

Faculty
Electrical Engineering, Mathematics and Computer Science
More Info
expand_more
Publication Year
2025
Language
English
Graduation Date
07-02-2025
Awarding Institution
Delft University of Technology
Project
CSE3000 Research Project
Programme
Computer Science and Engineering
Faculty
Electrical Engineering, Mathematics and Computer Science
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

The introduction of Pretrained Language Models (PLMs) has revolutionised the field of Natural Language Processing (NLP) and paved the way for many new, exciting large-scale studies for various areas of research. One such field presents itself in the emerging digital literary corpus that is fanfiction, providing research opportunities within the fields of (NLP), Computational (Socio-) Linguistics, the Social Sciences and Digital Humanities. However, because of the unique linguistic characteristics of this literary domain many modern NLP solutions utilizing PLMs encounter difficulties when applied on fanfiction texts. This paper aims to indicate that the performance of various NLP tasks performed by PLMs on fanfiction texts can be improved by applying Domain Adaptive Pre-Training (DAPT) to PLMs. A case-study is performed to show that the performance of a BERT-based PLM can be improved for the downstream NLP task of Named Entity Recognition (NER) by applying supervised domain specific fine-tuning. While we gain a 6% increase in F1 score performance, we are sceptical about these results due to the limited amount of annotated data available leading to the model overfitting and show a lack of capacity to generalize to unseen data from the CoNLL NER dataset.

Files

License info not available