Efficient Video Action Recognition

None, None

Efficient Video Action Recognition

How well does TriDet perform and generalize in a limited compute power and data setting?

Bachelor Thesis (2023)

Author(s)

A. Dămăcuș (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

J.C. van Gemert – Mentor (TU Delft - Pattern Recognition and Bioinformatics)

Ombretta Strafforello – Mentor (TU Delft - Pattern Recognition and Bioinformatics)

R. Bruintjes – Mentor (TU Delft - Pattern Recognition and Bioinformatics)

A. Lengyel – Mentor (TU Delft - Pattern Recognition and Bioinformatics)

Petr Kellnhofer – Graduation committee member (TU Delft - Computer Graphics and Visualisation)

Faculty

Electrical Engineering, Mathematics and Computer Science

Copyright

Temporal action localization Computer Vision TriDet TAL

To reference this document use:

https://resolver.tudelft.nl/uuid:c520548c-f30c-49bc-9058-e31952a15eb2

More Info

expand_more

Publication Year

2023

Language

English

Copyright

Graduation Date

29-06-2023

Awarding Institution

Delft University of Technology

Project

['CSE3000 Research Project']

Programme

['Computer Science and Engineering']

Faculty

Electrical Engineering, Mathematics and Computer Science

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

In temporal action localization, given an input video, the goal is to predict the action that is present in the video, along with its temporal boundaries. Several powerful models have been proposed throughout the years, with transformer-based models achieving state-of-the-art performance in the recent months. Although novel models are becoming more and more accurate, authors rarely study how limited training data or computation power environments affect the performance of their model. This study is carried out on TriDet, a transformer-based temporal action localization model that achieves state-of-the-art performance on two different benchmarks. It evaluates the model’s behavior in a limited training data and computation power environment. It is found that TriDet achieves close to state-of-the-art performance when only 60% of the training data or approximately 90 action instances per class are used. It is also notable that inference time, memory usage, multiply-accumulate operations and GPU utilization scale linearly along with the length of the tensor that is passed to the model. These findings, combined with TriDet’s mean training time of 11 minutes on the THUMOS’14 dataset can be used to determine the model’s hypothetical behavior when run in lower computation power environments.

Files

Alexandru_Damacus_Final_Paper.... (pdf)

(pdf | 3.12 Mb)

License info not available