Show and speak: Directly synthesize spoken description of images

Wang, X.; Feng, S.; Zhu, Jihua; Hasegawa-Johnson, Mark; Scharenborg, O.E.

doi:10.1109/ICASSP39728.2021.9414021

Show and speak

Title

Show and speak: Directly synthesize spoken description of images

Author

Wang, X. (TU Delft Multimedia Computing; Xi’an Jiaotong University)
Feng, S. (TU Delft Multimedia Computing)
Zhu, Jihua (Xi’an Jiaotong University)
Hasegawa-Johnson, Mark (University of Illinois at Urbana Champaign)
Scharenborg, O.E. (TU Delft Multimedia Computing)

Date

2021

Abstract

This paper proposes a new model, referred to as the show and speak (SAS) model that, for the first time, is able to directly synthesize spoken descriptions of images, bypassing the need for any text or phonemes. The basic structure of SAS is an encoder-decoder architecture that takes an image as input and predicts the spectrogram of speech that describes this image. The final speech audio is obtained from the predicted spectrogram via WaveNet. Extensive experiments on the public benchmark database Flickr8k demonstrate that the proposed SAS is able to synthesize natural spoken descriptions for images, indicating that synthesizing spoken descriptions for images while bypassing text and phonemes is feasible.

Subject

Encoder-decoder
Image captioning
Image-to-speech
Sequence-to-sequence
Speech synthesis

To reference this document use:

http://resolver.tudelft.nl/uuid:5ce6b416-ef81-41b6-adf9-8456cf455992

DOI

https://doi.org/10.1109/ICASSP39728.2021.9414021

Publisher

IEEE, Piscataway

ISBN

978-1-7281-7606-2

Source

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Event

ICASSP 2021, 2021-06-06 → 2021-06-11, Virtual Conference/Toronto, Canada

Series

ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, 1520-6149

Bibliographical note

Accepted author manuscript

Part of collection

Institutional Repository

Document type

conference paper

Rights

Files

PDF

ICASSP2021_Image2Speech.pdf

758.87 KB

Close viewer