Who said that? Comparing performance of TF-IDF and fastText to identify authorship of short sentences

Bachelor Thesis (2020)
Authors

T.A.R. van Tussenbroek (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Supervisors

David Tax (TU Delft - Pattern Recognition and Bioinformatics)

M. Loog (TU Delft - Pattern Recognition and Bioinformatics)

Faculty
Electrical Engineering, Mathematics and Computer Science, Electrical Engineering, Mathematics and Computer Science
Copyright
© 2020 Thomas van Tussenbroek
More Info
expand_more
Publication Year
2020
Language
English
Copyright
© 2020 Thomas van Tussenbroek
Graduation Date
22-06-2020
Awarding Institution
Delft University of Technology
Project
CSE3000 Research Project
Programme
Computer Science and Engineering
Faculty
Electrical Engineering, Mathematics and Computer Science, Electrical Engineering, Mathematics and Computer Science
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Authorship identification is often applied to large documents, but less so to short, everyday sentences. The ability of identifying who said a short line could provide help to chatbots or personal assistants. This research compares performance of TF-IDF and fastText when identifying authorship of short sentences, by applying these feature extraction techniques to the television series Friends' transcripts. TF-IDF outperforms fastText in every measurement, but its performance is only marginally better than randomly guessing the original character, reaching an accuracy of 28 percent when making a distinction between 6 characters. Accuracy increases linearly at the same rate for both techniques as the minimum word count per sentence set on the test data increases. TF-IDF's confidence remains constant as this limit is set on either the test or training data, whereas fastText's confidence decreases and increases, respectively. Cross-entropy loss, however, remains constant for fastText and decreases for TF-IDF as the minimum word count set on the test data increases.

Files

License info not available