Visually grounded fine-grained speech representations learning

More Info
expand_more

Abstract

Visually grounded speech representation learning has shown to be useful in the field of speech representation learning. Studies of learning visually grounded speech embedding adopted speech-image cross-modal retrieval task to evaluate the models, since the cross-modal retrieval task allows to jointly learn both modalities and find their relationships. Specifically, the two modalities, i.e., audio and visual, were jointly embedded into a common space where the speech embeddings and the image embeddings were learned in this process. The obtained embeddings were evaluated by cross-modal retrieval task to see the model performance. Currently, the studies worked on visually grounded speech representation learning trained on scene-based datasets, such as Flickr8k, etc., which learn different objects to infer a new scene. The works that investigating the visually grounded speech representation model's ability to combine different attribute information to infer new objects are lacking. Therefore, this thesis presented a visually grounded speech representation model trained on the fine-grained datasets that contain high level details of objects to learn attribute information associated with objects to infer new objects. The proposed model adopted dual-encoder structure and used different DNN models to extract visual and audio features. An adapted batch loss was used to calculate the similarities between two modalities. Experiments were conducted to test the model performance: 1) The parameter adjusting to obtain a better-performed model. 2) Comparing with state-of-the-art models in speech-image cross-modal retrieval field and fine-grained text-image cross-modal retrieval field. 3) Ablation studies to evaluate components in the model. 4) Research on attention module to see its effectiveness. The results indicated that the proposed model was able to learn the relationships between attributes and objects to retrieve new visual objects and outperformed other visually grounded speech learning models.