Limits on Modeling Compensation in Multimodal DNNs for Audio Visual Speech Recognition

More Info
expand_more

Abstract

Speech is a natural way of communicating that does not require us to develop any new skills in order to be able to interact with electronic devices. With the evolution of technology, speech has become one of the primary means of communication. Speech recognition is a form of multimedia content analysis, where the information carried in a speech signal is transcribed into a character string. Any information in the real world is perceived via several input channels. Each modality conveys some additional information about a real world concept. Likewise, the perception of speech in a human brain is bimodal in nature. We combine information from both visual and audio modalities to disambiguate speech. The system studied in here is a multimodal speech recognition system, where the features are generated by correlating visual and speech modalities using a multimodal Deep Belief Network. This thesis reproduces this system, and explores several aspects of its performance related to real-life conditions under which speech must be recognized. Since the limitations of multimodal deep learning approaches are not well comprehended, we would like to gain insights into the resemblance of such systems to humans in their ability to level multimodality. The experiments carried out by our study demonstrate that the visual modality complements speech modality, providing information such as place of articulation. Further studies are performed on the system to shed light on the limits of such a multimodal Deep Neural Network for Audio-Visual speech recognition. In real-life, Audio-Visual speech recognition systems will come across several perturbations such as reverberation and visual occlusion. The behavior of this system is analysed in a simulated environment replicating such real-life surroundings. Further, a study is performed to see the effect of the visual modality on recognition of phonemes, which are basic building blocks of speech. The study conducted in this thesis supports the conclusion that the multimodal Deep Neural Network is far from achieving human-like performance in the presence of perturbations. This demonstrates the necessity to conduct more research on the robustness of the multimodal Deep Neural Networks in real-life scenarios.

Files