Correspondence Between Perplexity Scores and Human Evaluation of Generated TV-Show Scripts

Bachelor Thesis (2020)
Author(s)

P. Keukeleire (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

Stavros Makrodimitris – Graduation committee member (TU Delft - Pattern Recognition and Bioinformatics)

Arman Naseri Naseri – Graduation committee member (TU Delft - Pattern Recognition and Bioinformatics)

Tom Julian Viering – Graduation committee member (TU Delft - Pattern Recognition and Bioinformatics)

M. Loog – Mentor (TU Delft - Pattern Recognition and Bioinformatics)

David Tax – Mentor (TU Delft - Pattern Recognition and Bioinformatics)

Faculty
Electrical Engineering, Mathematics and Computer Science
Copyright
© 2020 Pia Keukeleire
More Info
expand_more
Publication Year
2020
Language
English
Copyright
© 2020 Pia Keukeleire
Graduation Date
22-06-2020
Awarding Institution
Delft University of Technology
Project
['CSE3000 Research Project']
Programme
['Computer Science and Engineering']
Faculty
Electrical Engineering, Mathematics and Computer Science
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

In recent
years many new text generation models have been developed while evaluation of
text generation remains a considerable challenge.  Currently, the only metric that is able to
fully capture the quality of a generated text is human evaluation, which is
expensive and time consuming. One of the most used intrinsic evaluation metrics
is perplexity. This paper researched the correspondence between perplexity
scores and human evaluation of scripts for the TV-show \textit{Friends}
generated using OpenAI's GPT-2 model.  This
was done by conducting a survey taken by 226 participants that evaluated
selected scripts on creativity, realism and coherence.  The survey results revealed that generations
with a perplexity value close to that of an actual Friends script perform best
on creativity, but score low on realism and coherence.  The most realistic and coherent generations
were those with a lower perplexity value, while the worst in all fields were
the generations with the highest perplexity value.  The research shows that perplexity is not an
adequate measure for the quality of generated TV-show scripts.   



Files

License info not available