Evaluating Image2Speech

The evaluation of automatically generated phoneme captions for images

More Info
expand_more

Abstract

Image2Speech is the relatively new task of generating a spoken description of an image. Similar to Automatic Image Captioning, it is a task focused on describing images, however it avoids the usage of textual resources. An Image2Speech system produces a sequences of phonemes instead of (written) words which makes the Image2Speech task applicable to languages which do not have a standardized writing system. This thesis presents an investigation into the evaluation of the Image2Speech task. The Image2Speech output is evaluated with human evaluators as well as multiple objective evaluation metrics. These metrics are often used in the field of Natural Language Processing, such as BLEU, METEOR, PER, etc. and can be used to give an indication of the semantic similarity between two sentences of words. Since humans are the end users of Image2Speech systems, the objective evaluation metrics are correlated with human evaluation in order to determine which metric can best evaluate an Image2Speech system with the end users in mind. For this, first an Image2Speech system was implemented which generates image captions consisting of phoneme sequences. This system outperformed the original Image2Speech system on the Flickr8k corpus, which is a dataset containing 8,000 images which each image also having five written and spoken captions. Subsequently, these phoneme captions were converted into sentences of words in order to be more easily interpretable for human evaluators. The captions were rated by human evaluators for their goodness of describing the image and correlated with the objective evaluation metrics. Although BLEU4 does not perfectly correlate with human ratings, it obtained the highest correlation among the investigated metrics, and is the best currently existing metric for automatically evaluating the Image2Speech task. Current metrics are limited by the fact that they assume their input to be words. A more appropriate metric for the Image2Speech task should assume its input to be parts of words, e.g. phonemes, instead.