Building a visual speech recognizer

Driel, K.F.

Building a visual speech recognizer

Title

Building a visual speech recognizer

Author

Driel, K.F.

Contributor

Rothkrantz, L.J.M. (mentor)

Faculty

Electrical Engineering, Mathematics and Computer Science

Department

Mediamatics

Date

2009-08-24

Abstract

This thesis describes how an automatic lip reader was realized. Visual speech recognition is a precondition for more robust speech recognition in general. The development of the software comprised the following steps: gathering of training data, extracting meaningful features from the obtained video material, training the speech recognizer and finally evaluating the resulting product. First, research was done to gain insight on the theoretical aspects of automatic lip reading, the state of the art, speech corpus development, face tracking and feature extraction. Gathering training data came down to the recording and composing of a new audio-visual speech corpus for Dutch. With frontal and side images of 70 different speakers recorded at a frame rate of 100 frames per second this is the most diverse corpus currently in existence. Analysis of the new data corpus shows an increase in quality compared to other corpora. Visual information is obtained by searching the video footage. Using Active Appearance Models, points of an a priori defined model of the lower half of the face are tracked over time. Based on the model point coordinates, distance and area, features are computed that are used as input to the speech recognizer. Training was accomplished by presenting labeled training data to viseme-based Hidden Markov Models that model speech production. In a few steps the model parameters were adjusted, so that it could be used to perform recognition of visual speech signals from then on. The recognizer was implemented using tools from the Hidden Markov Model Toolkit. The results of a visual speech recognizer based on training data from a single person depend on the utterance type of the unlabeled data. For the simple word-level task of digit recognition 78% was recognized correctly with a word recognition rate of 68%. For letter recognition tasks it did not perform nearly as well, but considering the limitations that the use of visemes over phonemes imposes, these results are at the expected level. The data corpus and visual speech recognizer will be a valuable asset to future research.

Subject

automatic lip reading
visual speech recognition

To reference this document use:

http://resolver.tudelft.nl/uuid:dccc1188-0e63-44c5-ba98-5476d65f20c6

Part of collection

Student theses

Document type

master thesis

Rights

Files

PDF

Building_a_visual_speech_ ... _paper.pdf

3.75 MB

Close viewer