Use of LLMs to Improve Affiliation Disambiguation in Alexandria3k

Bachelor Thesis (2024)
Author(s)

D.T. Gupta (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

Diomidis Spinellis – Mentor (TU Delft - Software Engineering)

G. Gousios – Mentor (TU Delft - Software Technology)

Koen Langendoen – Graduation committee member (TU Delft - Embedded Systems)

Faculty
Electrical Engineering, Mathematics and Computer Science
Copyright
© 2024 Dibyendu Gupta
More Info
expand_more
Publication Year
2024
Language
English
Copyright
© 2024 Dibyendu Gupta
Graduation Date
01-02-2024
Awarding Institution
Delft University of Technology
Project
['CSE3000 Research Project']
Programme
['Computer Science and Engineering']
Faculty
Electrical Engineering, Mathematics and Computer Science
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

The growth of academic publications, heterogeneity of datasets and the absence of a globally accepted organization identifier introduce the challenge of affiliation disambiguation in bibliographic databases. In this paper, we create a baseline using the currently implemented algorithm for author affiliation linkage in Alexandria3k by comparing it to the ground truth. We aim to explore the usage of LLMs (GPT-4) in the Alexandria3k environment to disambiguate author affiliations. The proposed approach extracts the research organization from textual affiliations provided by researchers through their published works and cross-references the organization across the Research Organization Registry. Our process shows promising results and a significant improvement on the existing algorithm in terms of matching rate and identification of multiple affiliations. We discuss the margin of error in LLM results, limitations of the ground truth, and suggest future research directions.

Files

License info not available