Align or attend?

None, None; None, None; None, None; None, None; None, None

Align or attend?

Toward More Efficient and Accurate Spoken Word Discovery Using Speech-to-Image Retrieval

Conference Paper (2021)

Author(s)

Liming Wang (University of Illinois at Urbana Champaign)

X. Wang (Xi’an Jiaotong University, TU Delft - Multimedia Computing)

Mark Hasegawa-Johnson (University of Illinois at Urbana Champaign)

Odette Scharenborg (TU Delft - Multimedia Computing)

Najim Dehak (Johns Hopkins University)

Multimedia Computing

DOI related publication

https://doi.org/10.1109/ICASSP39728.2021.9414418

Language acquisition Low-resource speech technology Multimodal learning Spoken term discovery

To reference this document use:

https://resolver.tudelft.nl/uuid:043e055f-194b-4f66-a399-8e76e141429a

More Info

expand_more

Publication Year

2021

Language

English

Multimedia Computing

Pages (from-to)

7603-7607

ISBN (print)

978-1-7281-7606-2

ISBN (electronic)

978-1-7281-7605-5

Abstract

Multimodal word discovery (MWD) is often treated as a byproduct of the speech-to-image retrieval problem. However, our theoretical analysis shows that some kind of alignment/attention mechanism is crucial for a MWD system to learn meaningful word-level representation. We verify our theory by conducting retrieval and word discovery experiments on MSCOCO and Flickr8k, and empirically demonstrate that both neural MT with self-attention and statistical MT achieve word discovery scores that are superior to those of a state-of-the-art neural retrieval system, outperforming it by 2% and 5% alignment F1 scores respectively.

No files available

Metadata only record. There are no files for this record.