Align or attend?

Toward More Efficient and Accurate Spoken Word Discovery Using Speech-to-Image Retrieval

Conference Paper (2021)
Author(s)

Liming Wang (University of Illinois at Urbana Champaign)

X. Wang (Xi’an Jiaotong University, TU Delft - Multimedia Computing)

Mark Hasegawa-Johnson (University of Illinois at Urbana Champaign)

Odette Scharenborg (TU Delft - Multimedia Computing)

Najim Dehak (Johns Hopkins University)

Multimedia Computing
DOI related publication
https://doi.org/10.1109/ICASSP39728.2021.9414418
More Info
expand_more
Publication Year
2021
Language
English
Multimedia Computing
Pages (from-to)
7603-7607
ISBN (print)
978-1-7281-7606-2
ISBN (electronic)
978-1-7281-7605-5

Abstract

Multimodal word discovery (MWD) is often treated as a byproduct of the speech-to-image retrieval problem. However, our theoretical analysis shows that some kind of alignment/attention mechanism is crucial for a MWD system to learn meaningful word-level representation. We verify our theory by conducting retrieval and word discovery experiments on MSCOCO and Flickr8k, and empirically demonstrate that both neural MT with self-attention and statistical MT achieve word discovery scores that are superior to those of a state-of-the-art neural retrieval system, outperforming it by 2% and 5% alignment F1 scores respectively.

No files available

Metadata only record. There are no files for this record.