Long-term behaviour recognition in videos with actor-focused region attention

Conference Paper (2021)
Author(s)

Luca Ballan (TNO, Università degli Studi di Padova)

Ombretta Strafforello (TU Delft - BUS/TNO STAFF, TU Delft - Pattern Recognition and Bioinformatics, TNO)

Klamer Schutte (TNO)

Research Group
Pattern Recognition and Bioinformatics
Copyright
© 2021 Luca Ballan, O. Strafforello, Klamer Schutte
DOI related publication
https://doi.org/10.5220/0010215803620369
More Info
expand_more
Publication Year
2021
Language
English
Copyright
© 2021 Luca Ballan, O. Strafforello, Klamer Schutte
Research Group
Pattern Recognition and Bioinformatics
Pages (from-to)
362-369
ISBN (electronic)
9789897584886
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Long-Term activities involve humans performing complex, minutes-long actions. Differently than in traditional action recognition, complex activities are normally composed of a set of sub-actions, that can appear in different order, duration, and quantity. These aspects introduce a large intra-class variability, that can be hard to model. Our approach aims to adaptively capture and learn the importance of spatial and temporal video regions for minutes-long activity classification. Inspired by previous work on Region Attention, our architecture embeds the spatio-temporal features from multiple video regions into a compact fixed-length representation. These features are extracted with a 3D convolutional backbone specially fine-tuned. Additionally, driven by the prior assumption that the most discriminative locations in the videos are centered around the human that is carrying out the activity, we introduce an Actor Focus mechanism to enhance the feature extraction both in training and inference phase. Our experiments show that the Multi-Regional fine-tuned 3D-CNN, topped with Actor Focus and Region Attention, largely improves the performance of baseline 3D architectures, achieving state-of-the-art results on Breakfast, a well known long-term activity recognition benchmark.