Efficient Video Action Recognition
How well does TriDet perform and generalize in a limited compute power and data setting?
A. Dămăcuș (TU Delft - Electrical Engineering, Mathematics and Computer Science)
J.C. van Gemert – Mentor (TU Delft - Pattern Recognition and Bioinformatics)
Ombretta Strafforello – Mentor (TU Delft - Pattern Recognition and Bioinformatics)
R. Bruintjes – Mentor (TU Delft - Pattern Recognition and Bioinformatics)
A. Lengyel – Mentor (TU Delft - Pattern Recognition and Bioinformatics)
Petr Kellnhofer – Graduation committee member (TU Delft - Computer Graphics and Visualisation)
More Info
expand_more
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.
Abstract
In temporal action localization, given an input video, the goal is to predict the action that is present in the video, along with its temporal boundaries. Several powerful models have been proposed throughout the years, with transformer-based models achieving state-of-the-art performance in the recent months. Although novel models are becoming more and more accurate, authors rarely study how limited training data or computation power environments affect the performance of their model. This study is carried out on TriDet, a transformer-based temporal action localization model that achieves state-of-the-art performance on two different benchmarks. It evaluates the model’s behavior in a limited training data and computation power environment. It is found that TriDet achieves close to state-of-the-art performance when only 60% of the training data or approximately 90 action instances per class are used. It is also notable that inference time, memory usage, multiply-accumulate operations and GPU utilization scale linearly along with the length of the tensor that is passed to the model. These findings, combined with TriDet’s mean training time of 11 minutes on the THUMOS’14 dataset can be used to determine the model’s hypothetical behavior when run in lower computation power environments.