Correspondence Between Perplexity Scores and Human Evaluation of Generated TV-Show Scripts
P. Keukeleire (TU Delft - Electrical Engineering, Mathematics and Computer Science)
Stavros Makrodimitris – Graduation committee member (TU Delft - Pattern Recognition and Bioinformatics)
Arman Naseri Naseri – Graduation committee member (TU Delft - Pattern Recognition and Bioinformatics)
Tom Julian Viering – Graduation committee member (TU Delft - Pattern Recognition and Bioinformatics)
M. Loog – Mentor (TU Delft - Pattern Recognition and Bioinformatics)
David Tax – Mentor (TU Delft - Pattern Recognition and Bioinformatics)
More Info
expand_more
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.
Abstract
In recent
years many new text generation models have been developed while evaluation of
text generation remains a considerable challenge. Currently, the only metric that is able to
fully capture the quality of a generated text is human evaluation, which is
expensive and time consuming. One of the most used intrinsic evaluation metrics
is perplexity. This paper researched the correspondence between perplexity
scores and human evaluation of scripts for the TV-show \textit{Friends}
generated using OpenAI's GPT-2 model. This
was done by conducting a survey taken by 226 participants that evaluated
selected scripts on creativity, realism and coherence. The survey results revealed that generations
with a perplexity value close to that of an actual Friends script perform best
on creativity, but score low on realism and coherence. The most realistic and coherent generations
were those with a lower perplexity value, while the worst in all fields were
the generations with the highest perplexity value. The research shows that perplexity is not an
adequate measure for the quality of generated TV-show scripts.