Real-time Lipreading

Effects of Compression and Frame-rate

More Info
expand_more

Abstract

Speech recognition systems can be found all around us. From personal assistants in mobile phones and smart speakers to robots, we use speech recognition systems everyday. However, communicating with them can be troublesome in noisy environments because they only use audio signals for speech recognition. This problem can be solved by using visual speech recognition or lipreading systems. Research on lipreading systems has been going on since the 1980s but such systems are not being used in real-time systems yet. This can be attributed to the fact they need to process significantly higher amounts of data than audio speech processing which takes a lot of time and hence, they cannot be used in real-time. This thesis aims at finding out if frame rate, jpeg compression or presence of noise have any impact on the performance of lipreading system. The LipNet system is used for this thesis and the Lip Reading in the Wild (LRW) dataset is used for the purpose of experiments. The frame rate of videos of the dataset is varied from 11 to 25, with an increment of 2 for each experiment. Also, compression ratio is varied between no compression and 30 % quality, to find out how compression affects the performance of lipreading systems. Also, salt and pepper noise is artificially added to the dataset for the purpose of experiments. The results from the experiments showed that performance is not affected till frame rate 21, but it starts degrading gradually from frame rate 19 to 13 and after that there is sudden drop in the accuracy of LipNet. With compression of frames to 30 percent of their original quality, there is only a slight decrease in accuracy. However, there is a huge reduction in data size, which makes it easier to transmit data for cloud processing. We found substantial degradation in performance with the presence of noise with a probability of only 3 percent.
This means that if we decrease frame rate to 21 and compress the frames to 30 % quality, memory usage can be decreased to 25 % without much impact on performance of the system. However, quality of video capturing cameras and data transmission to cloud needs to be monitored to avoid noise.