Who said that? Comparing performance of TF-IDF and fastText to identify authorship of short sentences

None, None

Who said that? Comparing performance of TF-IDF and fastText to identify authorship of short sentences

Bachelor Thesis (2020)

Author(s)

T.A.R. van Tussenbroek (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

Tom Julian Viering – Graduation committee member (TU Delft - Pattern Recognition and Bioinformatics)

Stavros Makrodimitris – Graduation committee member (TU Delft - Pattern Recognition and Bioinformatics)

Arman Naseri Naseri – Graduation committee member (TU Delft - Pattern Recognition and Bioinformatics)

David Tax – Mentor (TU Delft - Pattern Recognition and Bioinformatics)

M. Loog – Mentor (TU Delft - Pattern Recognition and Bioinformatics)

Faculty

Electrical Engineering, Mathematics and Computer Science

Copyright

Natural Language Processing TF-IDF Authorship Identification FastText Short sentences

To reference this document use:

https://resolver.tudelft.nl/uuid:93873bbf-2886-4023-b696-e11be2b99024

More Info

expand_more

Publication Year

2020

Language

English

Copyright

Graduation Date

22-06-2020

Awarding Institution

Delft University of Technology

Project

['CSE3000 Research Project']

Programme

['Computer Science and Engineering']

Faculty

Electrical Engineering, Mathematics and Computer Science

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Authorship identification is often applied to large documents, but less so to short, everyday sentences. The ability of identifying who said a short line could provide help to chatbots or personal assistants. This research compares performance of TF-IDF and fastText when identifying authorship of short sentences, by applying these feature extraction techniques to the television series Friends' transcripts. TF-IDF outperforms fastText in every measurement, but its performance is only marginally better than randomly guessing the original character, reaching an accuracy of 28 percent when making a distinction between 6 characters. Accuracy increases linearly at the same rate for both techniques as the minimum word count per sentence set on the test data increases. TF-IDF's confidence remains constant as this limit is set on either the test or training data, whereas fastText's confidence decreases and increases, respectively. Cross-entropy loss, however, remains constant for fastText and decreases for TF-IDF as the minimum word count set on the test data increases.

Files

Research_Paper_Thomas_van_Tuss... (pdf)

(pdf | 0.885 Mb)

License info not available