OS

O. Strafforello

Authored

7 records found

Video BagNet

Short temporal receptive fields increase robustness in long-term action recognition

Previous work on long-term video action recognition relies on deep 3D-convolutional models that have a large temporal receptive field (RF). We argue that these models are not always the best choice for temporal modeling in videos. A large temporal receptive field allows the model ...
Many real-world applications, from sport analysis to surveillance, benefit from automatic long-term action recognition. In the current deep learning paradigm for automatic action recognition, it is imperative that models are trained and tested on datasets and tasks that evaluate ...
Long-Term activities involve humans performing complex, minutes-long actions. Differently than in traditional action recognition, complex activities are normally composed of a set of sub-actions, that can appear in different order, duration, and quantity. These aspects introduce ...
The localization quality of automatic object detectors is typically evaluated by the Intersection over Union (IoU) score. In this work, we show that humans have a different view on localization quality. To evaluate this, we conduct a survey with more than 70 participants. Results ...
Color is a crucial visual cue readily exploited by Convolutional Neural Networks (CNNs) for object recognition. However, CNNs struggle if there is data imbalance between color variations introduced by accidental recording conditions. Color invariance addresses this issue but does ...
Color is a crucial visual cue readily exploited by Convolutional Neural Networks (CNNs) for object recognition. However, CNNs struggle if there is data imbalance between color variations introduced by accidental recording conditions. Color invariance addresses this issue but does ...
In temporal action localization, given an input video, the goal is to predict which actions it contains, where they begin, and where they end. Training and testing current state-of- the-art deep learning models requires access to large amounts of data and computational power. How ...

Contributed

13 records found

Benchmarking Data and Computational Efficiency of ActionFormer on Temporal Action Localization Tasks

Analysing the Performance and Generalizability of ActionFormer in Resource-constrained Environments

In temporal action localization, given an input video, the goal is to predict which actions it contains, where they begin and where they end. Training and testing current state-of-the-art, deep learning models is done assuming access to large amounts of data and computational pow ...

Efficient Video Action Recognition

How well does TriDet perform and generalize in a limited compute power and data setting?

In temporal action localization, given an input video, the goal is to predict the action that is present in the video, along with its temporal boundaries. Several powerful models have been proposed throughout the years, with transformer-based models achieving state-of-the-art per ...
Bounding boxes are often used to communicate automatic object detection results to humans, aiding humans in a multitude of tasks. We investigate the relationship between bounding box localization errors and human task performance. We use observer performance studies on a visual m ...

Efficient Temporal Action Localization via Vision-Language Modelling

An Empirical Study on the STALE Model's Efficiency and Generalizability in Resource-constrained Environments

Temporal Action Localization (TAL) aims to localize the start and end times of actions in untrimmed videos and classify the corresponding action types. TAL plays an important role in understanding video. Existing TAL approaches heavily rely on deep learning and require large-scal ...

TemporalMaxer Performance in the Face of Constraint: A Study in Temporal Action Localization

A Comprehensive Analysis on the Adaptability of TemporalMaxer in Resource-Scarce Environments

This paper presents an analysis of the data and compute efficiency of the TemporalMaxer deep learning model in the context of temporal action localization (TAL), which involves accurately detecting the start and end times of specific video actions. The study explores the performa ...

Group Equivariant Video Action Recognition

Making action-recognition networks equivariant to temporal direction and discrete spatial rotations

This work applies the theory of group equivariance to the domain of video action recognition replacing standard 3Dconvolutions with group convolutions which are equivariant to temporal direction, and multiples of 90-degree spatial rotations. We propose a temporal direction symme ...
In the problem of video summarization, the goal is to select a subset of the input frames conveying the most important information of the input video. The collection of data proves to be a challenging task. In part because there exists a disagreement among human annotators on wha ...
There is growing research on automated video summarization following the rise of video content. However, the subjectivity of the task itself is still an issue to address. This subjectivity stems from the fact that there can be different summaries for the same video depending on w ...
Instance segmentation on data from Dynamic Vision Sensors (DVS) is an important computer vision task that needs to be tackled in order to push the research forward on these types of inputs. This paper aims to show that deep learning based techniques can be used to solve the task ...
The event-based camera represents a revolutionary concept, having an asynchronous output. The pixels of dynamic vision sensors react to the brightness change, resulting in streams of events at very small intervals of time. This paper provides a model to track objects in neuromorp ...
Event-based cameras do not capture frames like an RGB camera, only data from pixels that detect a change in light intensity, making it a better alternative for processing videos. The sparse data acquired from event-based video only captures movement in an asynchronous way. In thi ...
Event-based cameras represent a new alternative to traditional frame based sensors, with advantages in lower output bandwidth, lower latency and higher dynamic range, thanks to their independent, asynchronous pixels. These advantages prompted the development of computer vision me ...
Video summarization is a task which many researchers have tried to automate with deep learning methods. One of these methods is the SUM-GAN-AAE algorithm developed by Apostolidis et al. which is an unsupervised machine learning method evaluated in this study. The research aims at ...