Show and speak

None, None; None, None; None, None; None, None; None, None

Show and speak

Directly synthesize spoken description of images

Conference Paper (2021)

Author(s)

Xinsheng Wang (TU Delft - Multimedia Computing, Xi’an Jiaotong University)

Siyuan Feng (TU Delft - Multimedia Computing)

Jihua Zhu (Xi’an Jiaotong University)

Mark Hasegawa-Johnson (University of Illinois at Urbana Champaign)

Odette Scharenborg (TU Delft - Multimedia Computing)

Research Group

Multimedia Computing

DOI related publication

https://doi.org/10.1109/ICASSP39728.2021.9414021

Encoder-decoder Image captioning Image-to-speech Sequence-to-sequence Speech synthesis

To reference this document use:

https://resolver.tudelft.nl/uuid:5ce6b416-ef81-41b6-adf9-8456cf455992

More Info

expand_more

Publication Year

2021

Language

English

Research Group

Multimedia Computing

Pages (from-to)

4190-4194

ISBN (print)

978-1-7281-7606-2

ISBN (electronic)

978-1-7281-7605-5

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

This paper proposes a new model, referred to as the show and speak (SAS) model that, for the first time, is able to directly synthesize spoken descriptions of images, bypassing the need for any text or phonemes. The basic structure of SAS is an encoder-decoder architecture that takes an image as input and predicts the spectrogram of speech that describes this image. The final speech audio is obtained from the predicted spectrogram via WaveNet. Extensive experiments on the public benchmark database Flickr8k demonstrate that the proposed SAS is able to synthesize natural spoken descriptions for images, indicating that synthesizing spoken descriptions for images while bypassing text and phonemes is feasible.

Files

ICASSP2021_Image2Speech.pdf

(pdf | 0.741 Mb)

License info not available