Efficient Temporal Action Localization via Vision-Language Modelling

None, None

Efficient Temporal Action Localization via Vision-Language Modelling

An Empirical Study on the STALE Model's Efficiency and Generalizability in Resource-constrained Environments

Bachelor Thesis (2023)

Author(s)

Y. Wang (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

J.C. van Gemert – Mentor (TU Delft - Pattern Recognition and Bioinformatics)

R. Bruintjes – Mentor (TU Delft - Pattern Recognition and Bioinformatics)

A. Lengyel – Mentor (TU Delft - Pattern Recognition and Bioinformatics)

O. Strafforello – Mentor (TU Delft - Pattern Recognition and Bioinformatics)

P. Kellnhofer – Graduation committee member (TU Delft - Computer Graphics and Visualisation)

Faculty

Electrical Engineering, Mathematics and Computer Science

Copyright

Machine Learning AI Computer Vision Vision-language Modelling Temporal Action Localization Data effciency Compute Efficiency CLIP Multi-modal Learning STALE

To reference this document use:

https://resolver.tudelft.nl/uuid:61a3357a-95d5-4371-bf46-a34e1466642b

More Info

expand_more

Publication Year

2023

Language

English

Copyright

Graduation Date

29-06-2023

Awarding Institution

Delft University of Technology

Project

['CSE3000 Research Project']

Programme

['Computer Science and Engineering']

Faculty

Electrical Engineering, Mathematics and Computer Science

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Temporal Action Localization (TAL) aims to localize the start and end times of actions in untrimmed videos and classify the corresponding action types. TAL plays an important role in understanding video. Existing TAL approaches heavily rely on deep learning and require large-scale data and expensive training processes. Recent advances in Contrastive Language-Image Pre-Training (CLIP) have brought vision-language modeling into the field of TAL. While current CLIP-based TAL methods have been proven to be effective, their capabilities under data and compute-limited settings are not explored. In this paper, we have investigated the data and compute efficiencies of the CLIP-based STALE model. We evaluate the model performances under data-limited open/close-set scenarios. We find that STALE can demonstrate adequate generalizability using limited data. We experimented with the training time, inference time, GPU utilization, MACs, and memory consumption of STALE by inputting with varying video lengths. We discover an optimal input length for STALE to inference. Using model quantization, we find a significant forward time reduction for STALE on a single CPU. Our findings shed light on the capabilities and limitations of CLIP-based TAL methods under constrained data and compute resources. The insights gained from this research contribute to enhancing the efficiency and applicability of CLIP-based TAL techniques in real-world scenarios. The results provide valuable guidance for future advancements in CLIP-based TAL models and their potential for broader adoption in resource-constrained environments.

Files

Yunhan_Wang_BSc_Thesis.pdf

(pdf | 1.66 Mb)

License info not available