Group Equivariant Video Action Recognition

Making action-recognition networks equivariant to temporal direction and discrete spatial rotations

More Info
expand_more

Abstract

This work applies the theory of group equivariance to the domain of video action recognition replacing standard 3Dconvolutions with group convolutions which are equivariant to temporal direction, and multiples of 90-degree spatial rotations. We propose a temporal direction symmetry group T2 and extend the standard planar rotations group to three dimensions to form a 3D group that is equivariant to discrete 90-degree spatial rotations. We analyse the efficacy of using these 3D-G-CNNs as drop-in replacements in 3D networks by evaluating synthesized datasets containing handwritten MNIST digits moving over a black background, as well as popular action recognition datasets UCF-101and HMDB-51, and comparing the results against the performance of the standard 3D CNNs on the datasets.