Word recognition in a model of visually grounded speech

None, None

Word recognition in a model of visually grounded speech

An analysis using techniques inspired by human speech processing research

Master Thesis (2020)

Author(s)

J.S.M. Scholten (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

Odette Scharenborg – Mentor (TU Delft - Multimedia Computing)

Danny Merkx – Mentor (Radboud Universiteit Nijmegen)

N. Tintarev – Graduation committee member (TU Delft - Web Information Systems)

Catharine Oertel – Graduation committee member (TU Delft - Interactive Intelligence)

Faculty

Electrical Engineering, Mathematics and Computer Science

Copyright

Recurrent Neural Network Visually Grounded Speech Flickr8k Automatic Speech Recognition Word Recognition

To reference this document use:

https://resolver.tudelft.nl/uuid:aa2e6e66-5dac-4a8b-a247-63b94974b211

More Info

expand_more

Publication Year

2020

Language

English

Copyright

Graduation Date

24-07-2020

Awarding Institution

Delft University of Technology

Programme

['Computer Science']

Faculty

Electrical Engineering, Mathematics and Computer Science

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

A Visually Grounded Speech model is a neural model which is trained to embed image caption pairs closely together in a common embedding space. As a result, such a model can retrieve semantically related images given a speech caption and vice versa. The purpose of this research is to investigate whether and how a Visually Grounded Speech model can recognise individual words. Literature on Word Recognition in hu- mans, Automatic Speech Recognition and Visually Grounded Speech models was evaluated. Techniques used to analyse human speech processing, such as gating and priming, were taken as inspiration for the design of the experiments used in this thesis. Multiple aspects of words recognition were investigated through three experiments. Firstly, it was investigated whether the model can recognise individual words. Secondly, it was investigated whether the model can recognise words from a partial sequence of its phonemes. Thirdly, it was investigated how word recognition is affected by contextual information. The experiments show that the model is able to recognise words while not being supervised for that task, and that factors such as word frequency, the length of a word and the speaking rate affect word recognition. Furthermore, the experiments reveal that words can be recognised from a partial input of a word’s phoneme sequence as well, and that recognition is negatively influenced by word competition from the word initial cohort. Furthermore, the word recognition in context experiment reveals that contextual information can enhance the recognition of words which are recognised less well.

Files

Master_Thesis_Sebastiaan_Schol... (pdf)

(pdf | 9.17 Mb)

License info not available