Video BagNet

Short temporal receptive fields increase robustness in long-term action recognition

Conference Paper (2023)
Author(s)

Ombretta Strafforello (TNO, TU Delft - Pattern Recognition and Bioinformatics)

Xin Liu (TU Delft - Pattern Recognition and Bioinformatics)

Klamer Schutte (TNO)

Jan van Gemert (TU Delft - Pattern Recognition and Bioinformatics)

Research Group
Pattern Recognition and Bioinformatics
DOI related publication
https://doi.org/10.1109/ICCVW60793.2023.00023
More Info
expand_more
Publication Year
2023
Language
English
Research Group
Pattern Recognition and Bioinformatics
Bibliographical Note
Green Open Access added to TU Delft Institutional Repository 'You share, we take care!' - Taverne project https://www.openaccess.nl/en/you-share-we-take-care Otherwise as indicated in the copyright section: the publisher is the copyright holder of this work and the author uses the Dutch legislation to make this work public.@en
Pages (from-to)
159-166
Publisher
IEEE
ISBN (print)
979-8-3503-0745-0
ISBN (electronic)
979-8-3503-0744-3
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Previous work on long-term video action recognition relies on deep 3D-convolutional models that have a large temporal receptive field (RF). We argue that these models are not always the best choice for temporal modeling in videos. A large temporal receptive field allows the model to encode the exact sub-action order of a video, which causes a performance decrease when testing videos have a different sub-action order. In this work, we investigate whether we can improve the model robustness to the sub-action order by shrinking the temporal receptive field of action recognition models. For this, we design Video BagNet, a variant of the 3D ResNet-50 model with the temporal receptive field size limited to 1, 9, 17 or 33 frames. We analyze Video Bag-Net on synthetic and real-world video datasets and experimentally compare models with varying temporal receptive fields. We find that short receptive fields are robust to sub-action order changes, while larger temporal receptive fields are sensitive to the sub-action order.

Files

Video_BagNet_short_temporal_re... (pdf)
(pdf | 1.1 Mb)
- Embargo expired in 25-06-2024
License info not available