Word recognition in a model of visually grounded speech: An analysis using techniques inspired by human speech processing research

Scholten, J.S.M.

Word recognition in a model of visually grounded speech

Title

Word recognition in a model of visually grounded speech: An analysis using techniques inspired by human speech processing research

Author

Scholten, J.S.M. (TU Delft Electrical Engineering, Mathematics and Computer Science)

Contributor

Scharenborg, O.E. (mentor)
Merkx, Danny (mentor)
Tintarev, N. (graduation committee)
Oertel Genannt Bierbach, C.R.M.M. (graduation committee)

Degree granting institution

Delft University of Technology

Programme

Computer Science

Date

2020-07-24

Abstract

A Visually Grounded Speech model is a neural model which is trained to embed image caption pairs closely together in a common embedding space. As a result, such a model can retrieve semantically related images given a speech caption and vice versa. The purpose of this research is to investigate whether and how a Visually Grounded Speech model can recognise individual words. Literature on Word Recognition in hu- mans, Automatic Speech Recognition and Visually Grounded Speech models was evaluated. Techniques used to analyse human speech processing, such as gating and priming, were taken as inspiration for the design of the experiments used in this thesis. Multiple aspects of words recognition were investigated through three experiments. Firstly, it was investigated whether the model can recognise individual words. Secondly, it was investigated whether the model can recognise words from a partial sequence of its phonemes. Thirdly, it was investigated how word recognition is affected by contextual information. The experiments show that the model is able to recognise words while not being supervised for that task, and that factors such as word frequency, the length of a word and the speaking rate affect word recognition. Furthermore, the experiments reveal that words can be recognised from a partial input of a word’s phoneme sequence as well, and that recognition is negatively influenced by word competition from the word initial cohort. Furthermore, the word recognition in context experiment reveals that contextual information can enhance the recognition of words which are recognised less well.

Subject

Visually Grounded Speech
Recurrent Neural Network
Flickr8k
Automatic Speech Recognition
Word Recognition

To reference this document use:

http://resolver.tudelft.nl/uuid:aa2e6e66-5dac-4a8b-a247-63b94974b211

Part of collection

Student theses

Document type

master thesis

Rights

Files

PDF

Master_Thesis_Sebastiaan_ ... 690591.pdf

9.17 MB

Close viewer