Efficient Video Action Recognition

How well does TriDet perform and generalize in a limited compute power and data setting?

Bachelor Thesis (2023)
Author(s)

A. Dămăcuș (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

J.C. van Gemert – Mentor (TU Delft - Pattern Recognition and Bioinformatics)

Ombretta Strafforello – Mentor (TU Delft - Pattern Recognition and Bioinformatics)

R. Bruintjes – Mentor (TU Delft - Pattern Recognition and Bioinformatics)

A. Lengyel – Mentor (TU Delft - Pattern Recognition and Bioinformatics)

Petr Kellnhofer – Graduation committee member (TU Delft - Computer Graphics and Visualisation)

Faculty
Electrical Engineering, Mathematics and Computer Science
Copyright
© 2023 Alex Dămăcuş
More Info
expand_more
Publication Year
2023
Language
English
Copyright
© 2023 Alex Dămăcuş
Graduation Date
29-06-2023
Awarding Institution
Delft University of Technology
Project
['CSE3000 Research Project']
Programme
['Computer Science and Engineering']
Faculty
Electrical Engineering, Mathematics and Computer Science
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

In temporal action localization, given an input video, the goal is to predict the action that is present in the video, along with its temporal boundaries. Several powerful models have been proposed throughout the years, with transformer-based models achieving state-of-the-art performance in the recent months. Although novel models are becoming more and more accurate, authors rarely study how limited training data or computation power environments affect the performance of their model. This study is carried out on TriDet, a transformer-based temporal action localization model that achieves state-of-the-art performance on two different benchmarks. It evaluates the model’s behavior in a limited training data and computation power environment. It is found that TriDet achieves close to state-of-the-art performance when only 60% of the training data or approximately 90 action instances per class are used. It is also notable that inference time, memory usage, multiply-accumulate operations and GPU utilization scale linearly along with the length of the tensor that is passed to the model. These findings, combined with TriDet’s mean training time of 11 minutes on the THUMOS’14 dataset can be used to determine the model’s hypothetical behavior when run in lower computation power environments.

Files

License info not available