Towards Natural Language Understanding using Multimodal Deep Learning

More Info
expand_more

Abstract

This thesis describes how multimodal sensor data from a 3D sensor and microphone array can be processed with deep neural networks such that its fusion, the trained neural network, is a) more robust to noise, b) outperforms unimodal recognition and c) enhances unimodal recognition in absence of multimodal data. We built a framework for a complete workflow to experiment with multimodal sensor data ranging from recording (with Kinect 3D sensor), labeling, 3D signal processing, analysing and replaying. We also built three custom recognizers (automatic speech recognizer, 3D object recognizer and 3D gesture recognizer) to convert the raw sensor streams to decisions and feed this to the neural network using a late fusion strategy. We recorded 25 particpants performing 27 unique verbal and gestural interactions (intents) with objects and trained the neural network using a supervised strategy. We proved that the framework works by building a deep neural networks assisted speech recognizer that performs approximately 5% better with multimodal data at 20 dB SnR up to 61% better with multimodal data at -5 dB SnR while performing identical to the individual recognizer when fed a unimodal datastream. Analysis shows that performance gain in low acoustic noise is due to true fusion of classifer results while gain at high acoustic noise is due to absence of speech results as it cannot detect speech events anymore, while the gesture recognizer is not affected. The impact of this thesis is significant for computational linguists and computer vision researchers as it describes how practical issues with (real and) real-time data can be solved such as dealing with sensor noise, GPU offloading for computational performance, 3D object and hand tracking. The speech-, object- and gesture recognizers are not state-of-the-art and the small vocabulary with 27 unique phrases and 9 objects can be considered a preliminary experiment. The main contributions of this thesis project are a) validated multimodal fusion framework and workflow for embodied natural language understanding named MASU, b) 600GB, 2,5 hour labelled multimodal database with synchronous multi channel audio and 3D video, c) algorithm for 3D hand-object detection and tracking, d) recipe to train a deep neural network model for multimodal fusion and e) demontrate MASU in practical real-time scenario.