From Deterministic to Generative

Multimodal Stochastic RNNs for Video Captioning

Journal Article (2018)
Author(s)

Jingkuan Song (University of Electronic Science and Technology of China)

Yuyu Guo (University of Electronic Science and Technology of China)

Lianli Gao (University of Electronic Science and Technology of China)

Xuelong Li (Chinese Academy of Sciences)

A Hanjalic (TU Delft - Intelligent Systems)

Heng Tao Shen (University of Electronic Science and Technology of China)

Department
Intelligent Systems
Copyright
© 2018 Jingkuan Song, Yuyu Guo, Lianli Gao, Xuelong Li, A. Hanjalic, Heng Tao Shen
DOI related publication
https://doi.org/10.1109/TNNLS.2018.2851077
More Info
expand_more
Publication Year
2018
Language
English
Copyright
© 2018 Jingkuan Song, Yuyu Guo, Lianli Gao, Xuelong Li, A. Hanjalic, Heng Tao Shen
Department
Intelligent Systems
Bibliographical Note
Accepted Author Manuscript@en
Pages (from-to)
1-12

Abstract

Video captioning, in essential, is a complex natural process, which is affected by various uncertainties stemming from video content, subjective judgment, and so on. In this paper, we build on the recent progress in using encoder-decoder framework for video captioning and address what we find to be a critical deficiency of the existing methods that most of the decoders propagate deterministic hidden states. Such complex uncertainty cannot be modeled efficiently by the deterministic models. In this paper, we propose a generative approach, referred to as multimodal stochastic recurrent neural networks (MS-RNNs), which models the uncertainty observed in the data using latent stochastic variables. Therefore, MS-RNN can improve the performance of video captioning and generate multiple sentences to describe a video considering different random factors. Specifically, a multimodal long short-term memory (LSTM) is first proposed to interact with both visual and textual features to capture a high-level representation. Then, a backward stochastic LSTM is proposed to support uncertainty propagation by introducing latent variables. Experimental results on the challenging data sets, microsoft video description and microsoft research video-to-text, show that our proposed MS-RNN approach outperforms the state-of-the-art video captioning benchmarks.

No files available

Metadata only record. There are no files for this record.