Long-term behaviour recognition in videos with actor-focused region attention

None, None; None, None; None, None

Long-term behaviour recognition in videos with actor-focused region attention

Conference Paper (2021)

Author(s)

Luca Ballan (TNO, Università degli Studi di Padova)

Ombretta Strafforello (TU Delft - BUS/TNO STAFF, TU Delft - Electrical Engineering, Mathematics and Computer Science, TNO)

Klamer Schutte (TNO)

Research Group

Pattern Recognition and Bioinformatics

3D convolutional neural networks Action recognition Region attention Video classification

DOI related publication

https://doi.org/10.5220/0010215803620369 Final published version

To reference this document use

https://resolver.tudelft.nl/uuid:bdc14b3b-f4e5-410c-a7df-4565e5bbdc33

More Info

expand_more

Publication Year

2021

Language

English

Research Group

Pattern Recognition and Bioinformatics

Pages (from-to)

362-369

ISBN (electronic)

9789897584886

Event

16th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications, VISIGRAPP 2021 (2021-02-08 - 2021-02-10), Virtual, Online

Downloads counter

256

Collections

Institutional Repository

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Long-Term activities involve humans performing complex, minutes-long actions. Differently than in traditional action recognition, complex activities are normally composed of a set of sub-actions, that can appear in different order, duration, and quantity. These aspects introduce a large intra-class variability, that can be hard to model. Our approach aims to adaptively capture and learn the importance of spatial and temporal video regions for minutes-long activity classification. Inspired by previous work on Region Attention, our architecture embeds the spatio-temporal features from multiple video regions into a compact fixed-length representation. These features are extracted with a 3D convolutional backbone specially fine-tuned. Additionally, driven by the prior assumption that the most discriminative locations in the videos are centered around the human that is carrying out the activity, we introduce an Actor Focus mechanism to enhance the feature extraction both in training and inference phase. Our experiments show that the Multi-Regional fine-tuned 3D-CNN, topped with Actor Focus and Region Attention, largely improves the performance of baseline 3D architectures, achieving state-of-the-art results on Breakfast, a well known long-term activity recognition benchmark.

Files

102158.pdf

(pdf | 2.18 Mb)