Searched for: +
(1 - 1 of 1)
document
Wang, X. (author), Feng, S. (author), Zhu, Jihua (author), Hasegawa-Johnson, Mark (author), Scharenborg, O.E. (author)
This paper proposes a new model, referred to as the show and speak (SAS) model that, for the first time, is able to directly synthesize spoken descriptions of images, bypassing the need for any text or phonemes. The basic structure of SAS is an encoder-decoder architecture that takes an image as input and predicts the spectrogram of speech that...
conference paper 2021