Real-time Lipreading

None, None

Real-time Lipreading

Effects of Compression and Frame-rate

Master Thesis (2019)

Author(s)

R. Maan (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

D.J. Broekens – Mentor (TU Delft - Interactive Intelligence)

Faculty

Electrical Engineering, Mathematics and Computer Science

Copyright

Deep learning Lipreading AVSR

To reference this document use:

https://resolver.tudelft.nl/uuid:a7b5241d-5002-446f-bd23-b69c78549ea8

More Info

expand_more

Publication Year

2019

Language

English

Copyright

Graduation Date

30-08-2019

Awarding Institution

Delft University of Technology

Programme

['Computer Science']

Faculty

Electrical Engineering, Mathematics and Computer Science

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Speech recognition systems can be found all around us. From personal assistants in mobile phones and smart speakers to robots, we use speech recognition systems everyday. However, communicating with them can be troublesome in noisy environments because they only use audio signals for speech recognition. This problem can be solved by using visual speech recognition or lipreading systems. Research on lipreading systems has been going on since the 1980s but such systems are not being used in real-time systems yet. This can be attributed to the fact they need to process significantly higher amounts of data than audio speech processing which takes a lot of time and hence, they cannot be used in real-time. This thesis aims at finding out if frame rate, jpeg compression or presence of noise have any impact on the performance of lipreading system. The LipNet system is used for this thesis and the Lip Reading in the Wild (LRW) dataset is used for the purpose of experiments. The frame rate of videos of the dataset is varied from 11 to 25, with an increment of 2 for each experiment. Also, compression ratio is varied between no compression and 30 % quality, to find out how compression affects the performance of lipreading systems. Also, salt and pepper noise is artificially added to the dataset for the purpose of experiments. The results from the experiments showed that performance is not affected till frame rate 21, but it starts degrading gradually from frame rate 19 to 13 and after that there is sudden drop in the accuracy of LipNet. With compression of frames to 30 percent of their original quality, there is only a slight decrease in accuracy. However, there is a huge reduction in data size, which makes it easier to transmit data for cloud processing. We found substantial degradation in performance with the presence of noise with a probability of only 3 percent.
This means that if we decrease frame rate to 21 and compress the frames to 30 % quality, memory usage can be decreased to 25 % without much impact on performance of the system. However, quality of video capturing cameras and data transmission to cloud needs to be monitored to avoid noise.

Files

Riya_Maan_thesis.pdf

(pdf | 8.2 Mb)

License info not available