Synthesizing Spoken Descriptions of Images

None, None; None, None; None, None; None, None; None, None

Synthesizing Spoken Descriptions of Images

Journal Article (2021)

Author(s)

Xinsheng Wang (TU Delft - Multimedia Computing, Xi’an Jiaotong University)

Justin van der Hout (Student TU Delft)

Jihua Zhu (Xi’an Jiaotong University)

Mark Hasegawa-Johnson (University of Illinois at Urbana Champaign)

O.E. Scharenborg (TU Delft - Multimedia Computing)

Multimedia Computing

Copyright

DOI related publication

https://doi.org/10.1109/TASLP.2021.3120644

Speech processing Image-to-speech generation Multimodal modelling Speech synthesis Cross-modal captioning

To reference this document use:

https://resolver.tudelft.nl/uuid:914070cb-d4fe-4bad-8064-bc83154b895b

More Info

expand_more

Publication Year

2021

Language

English

Copyright

Multimedia Computing

Volume number

29

Pages (from-to)

3242-3254

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Image captioning technology has great potential in many scenarios. However, current text-based image captioning methods cannot be applied to approximately half of the world's languages due to these languages’ lack of a written form. To solve this problem, recently the image-to-speech task was proposed, which generates spoken descriptions of images bypassing any text via an intermediate representation consisting of phonemes (image-to-phoneme). Here, we present a comprehensive study on the image-to-speech task in which, 1) several representative image-to-text generation methods are implemented for the image-to-phoneme task, 2) objective metrics are sought to evaluate the image-to-phoneme task, and 3) an end-to-end image-to-speech model that is able to synthesize spoken descriptions of images bypassing both text and phonemes is proposed. Extensive experiments are conducted on the public benchmark database Flickr8k. Results of our experiments demonstrate that 1) State-of-the-art image-to-text models can perform well on the image-to-phoneme task, and 2) several evaluation metrics, including BLEU3, BLEU4, BLEU5, and ROUGE-L can be used to evaluate image-to-phoneme performance. Finally, 3) end-to-end image-to-speech bypassing text and phonemes is feasible.

Files

Image2Speech_Journal.pdf

(pdf | 1.09 Mb)

License info not available