Towards Robust Visual Speech Recognition

Automatic Systems for Lip Reading of Dutch

More Info
expand_more

Abstract

In the last two decades we witnessed a rapid increase of the computational power governed by Moore's Law. As a side effect, the affordability of cheaper and faster CPUs increased as well. Therefore, many new “smart” devices flooded the market and made informational systems widely spread. The number of users of information systems has also increased many folds, and the user's characteristics have changed to include not only a small number of initiates but also a majority of non technical people. To make this transition possible systems' developers had to change the computer user interfaces in order to make it simpler and more intuitive. However, the interaction was still based on rather artificial devices such as mouse and keyboard. Since the Moore's Law continues to work over and over again we came to a critical moment when the current systems can easily cope with other input streams such as video and audio, to name the most important, and others. We can, therefore, envision systems with which we can communicate through speech and body movements and that can automatically and transparently adapt to the environment and user. This can be done for instance by recognizing the user affective state, by understanding the task of the user and recognizing the context of the interaction. Automatic speech recognition by capturing and processing the audio signal is one development in this direction. Even though in controlled settings automatic speech recognition has achieved spectacular results, its performance is still dependent on the context (for instance on the level of the background noise). Automatic lip reading has appeared in this context as a way to enhance automatic speech recognition in noisy environments. Even though it is still a relatively novel research domain, other applications were found which employ lip reading as stand alone: interfaces for hearing impaired persons, security applications, speech recovery from mute of deteriorated films, silence interfaces. With the advances in visual signal processing the research in lip reading was also revitalized. However, at the moment of writing of this thesis lip reading was still waiting for its great leap. This thesis investigates several techniques for directing lip reading towards more robust performances. The thesis starts by introducing the relevant methodologies that govern automatic lip reading. Next it introduces all the concepts needed to understand the technologies, experiments, results and discussions presented later on. It is, therefore, one of the most important parts of the thesis. The presentation of the state of the art in lip reading is setting the starting point of the research presented. Before, continuing to follow the lip reading process the thesis introduces the mathematical foundations that give the theoretical support for the analysis. All our systems are based on the Hidden Markov Models approach. This paradigm has proved to be very useful in similar problems and we successfully employed it for lip reading. The main idea behind it is the Bayesian rule which says that starting from some a-priori knowledge we can always improve our understanding of a system through observation. Observation translates into processing the video stream in a set of features that describe what is being said by the speaker. However, in order to appropriately train lip reading systems, a large amount of data is needed. The first important contribution of our research is a large data corpus for the Dutch language. This corpus, named “New Delft University of Technology Audio Visual Speech Corpus”, is at the date of writing this thesis one of the largest corpora for lip reading in Dutch. The corpus contains dual view high speed recordings (i.e. 100Hz) of continuous speech in Dutch. During the building of the corpus, we also produced an incipient set of guidelines for building a data corpus for lip reading which we hope to be useful for other researchers. However, the core of this thesis consists in the data parametrization. Data parametrization is the process that transforms the input video data in a set of features that are used later on for training and testing the resulting recognizer. The parametrization should reduce the size of the input data while preserving the most important information related with what the speaker says. We investigated three data parametrization techniques each coming from a different category of algorithms. We used Active Appearance Models which generate a combined geometric and appearance based set of features, we used optical flow analysis which is an appearance based approach that directly accounts for the apparent movement on the speaker's face and we used a statistical approach which generates the geometry of lips without starting from an a-priori fixed model. During the research presented in this thesis we investigated the performances of these data parametrization techniques and we pointed out their strengths and weaknesses. We also analysed the performance of lip reading starting from other points of view. We analysed the influence of the sampling rate of the video data, the performance of the lip readers as a function of the recognition task but also the performance as a function of the size of the corpus used. Answering to all these questions improved our understanding of the limitations and the possible improvements of lip reading.