Evaluating automatically generated phoneme captions for images

Conference Paper (2020)
Author(s)

Justin van der Hout (Student TU Delft)

Zoltán D’Haese (Katholieke Universiteit Leuven)

Mark Hasegawa-Johnson (University of Illinois at Urbana Champaign)

O.E. Scharenborg (TU Delft - Multimedia Computing)

Multimedia Computing
Copyright
© 2020 Justin van der Hout, Zoltán D’Haese, Mark Hasegawa-Johnson, O.E. Scharenborg
DOI related publication
https://doi.org/10.21437/Interspeech.2020-2870
More Info
expand_more
Publication Year
2020
Language
English
Copyright
© 2020 Justin van der Hout, Zoltán D’Haese, Mark Hasegawa-Johnson, O.E. Scharenborg
Multimedia Computing
Pages (from-to)
2317 - 2321
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Image2Speech is the relatively new task of generating a spoken description of an image. This paper presents an investigation into the evaluation of this task. For this, first an Image2Speech system was implemented which generates image captions consisting of phoneme sequences. This system outperformed the original Image2Speech system on the Flickr8k corpus. Subsequently, these phoneme captions were converted into sentences of words. The captions were rated by human evaluators for their goodness of describing the image. Finally, several objective metric scores of the results were correlated with these human ratings. Although BLEU4 does not perfectly correlate with human ratings, it obtained the highest correlation among the investigated metrics, and is the best currently existing metric for the Image2Speech task. Current metrics are limited by the fact that they assume their input to be words. A more appropriate metric for the Image2Speech task should assume its input to be parts of words, i.e. phonemes, instead.

Files

License info not available