Correspondence Between Perplexity Scores and Human Evaluation of Generated TV-Show Scripts

None, None

Correspondence Between Perplexity Scores and Human Evaluation of Generated TV-Show Scripts

Bachelor Thesis (2020)

Author(s)

P. Keukeleire (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

Stavros Makrodimitris – Graduation committee member (TU Delft - Pattern Recognition and Bioinformatics)

Arman Naseri Naseri – Graduation committee member (TU Delft - Pattern Recognition and Bioinformatics)

Tom Julian Viering – Graduation committee member (TU Delft - Pattern Recognition and Bioinformatics)

M. Loog – Mentor (TU Delft - Pattern Recognition and Bioinformatics)

David Tax – Mentor (TU Delft - Pattern Recognition and Bioinformatics)

Faculty

Electrical Engineering, Mathematics and Computer Science

Copyright

Natural Language Processing Natural Language Generation Perplexity Human evaluation

To reference this document use:

https://resolver.tudelft.nl/uuid:ab543db3-f285-477c-b4ce-b6ac57507554

More Info

expand_more

Publication Year

2020

Language

English

Copyright

Graduation Date

22-06-2020

Awarding Institution

Delft University of Technology

Project

['CSE3000 Research Project']

Programme

['Computer Science and Engineering']

Faculty

Electrical Engineering, Mathematics and Computer Science

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

In recent
years many new text generation models have been developed while evaluation of
text generation remains a considerable challenge. Currently, the only metric that is able to
fully capture the quality of a generated text is human evaluation, which is
expensive and time consuming. One of the most used intrinsic evaluation metrics
is perplexity. This paper researched the correspondence between perplexity
scores and human evaluation of scripts for the TV-show \textit{Friends}
generated using OpenAI's GPT-2 model. This
was done by conducting a survey taken by 226 participants that evaluated
selected scripts on creativity, realism and coherence. The survey results revealed that generations
with a perplexity value close to that of an actual Friends script perform best
on creativity, but score low on realism and coherence. The most realistic and coherent generations
were those with a lower perplexity value, while the worst in all fields were
the generations with the highest perplexity value. The research shows that perplexity is not an
adequate measure for the quality of generated TV-show scripts.

Files

Research_paper_final_version.p... (pdf)

(pdf | 0.306 Mb)

License info not available