O. Strafforello | TU Delft Repository

Rethinking the objectives of computer vision systems

Doctoral thesis (2024) - Ombretta Strafforello (author) , M.J.T. Reinders (promotor) , Jan van van Gemert (promotor)

Computer vision systems, such as image classifiers, object detectors and video analysis tools, serve diverse applications, ranging from autonomous vehicles and drone navigation to medical image analysis and anomaly inspection in the manufacturing industry. The development of thes ...

Computer vision systems, such as image classifiers, object detectors and video analysis tools, serve diverse applications, ranging from autonomous vehicles and drone navigation to medical image analysis and anomaly inspection in the manufacturing industry. The development of these systems relies heavily on well established practices, which include the adoption of conventional training and evaluation metrics and benchmark datasets. However, we argue that standard approaches are sub-optimal with respect to the ultimate objectives of the computer vision systems. In this thesis, we question whether the training and evaluation of computer vision systems for object detection and long-term action recognition are typically aligned with human-defined end goals.

Object detectors are deployed for object tracking in autonomous vehicles and drones, but also as user assistive tools in medical image analysis and anomaly inspection in industry. Regardless of the end use, object detectors are trained with standard optimization and evaluation strategies. By investigating whether the optimization and evaluation methods of object detectors correlate with human quality judgments, we discover a discrepancy between established metrics and human preferences. To address this, we propose an alternative training loss that better aligns object detectors with human preference.

Subsequently, we ask whether object detections can be used to improve longterm human action recognition in videos. We find that explicitly focusing on the region containing the detected human is beneficial to long-term action recognition models. Unexpectedly, we also find that including a temporal attention module does not help recognizing the videos. Motivated by this result, we investigate how much temporal information is needed to solve long-term action recognition in three popular video datasets. Our results show that most of these videos can be recognized without any long-term temporal information. This suggests that models trained on these videos might exploit short-term shortcuts, instead of learning long-term temporal dependencies. Importantly, these models would not perform successfully on new videos where long-term reasoning is necessary.

As a follow-up, we investigate the impact of the temporal receptive field in longterm action recognition models. The size of the temporal receptive field determines the capability to encode long-term information in videos, like the actions order and duration. We experimentally verify that large temporal receptive fields are sensitive to order and can overfit on the exact action orders seen at training time. Contrarily, short temporal receptive fields are more robust to order permutations and perform better on a current long-term video dataset. This result further demonstrates the irrelevance of long-term information in current long-term action recognition datasets. Our research findings highlight the importance of using training and evaluation metrics that match the intended use of the computer vision systems and choosing training and evaluation datasets that carefully represent the problem at hand.

Benchmarking Data Efficiency and Computational Efficiency of Temporal Action Localization Models

Conference paper (2023) - J. Warchocki (author) , T. Oprescu (author) , Y Wang (author) , A. Dămăcuș (author) , P.M. Misterka (author) , R. Bruintjes (author) , A. Lengyel (author) , Ombretta Strafforello (author) , J.C. Van Gemert (author)

In temporal action localization, given an input video, the goal is to predict which actions it contains, where they begin, and where they end. Training and testing current state-of- the-art deep learning models requires access to large amounts of data and computational power. How ...

Color Equivariant Convolutional Networks

Conference paper (2023) - A. Lengyel (author) , Ombretta Strafforello (author) , R. Bruintjes (author) , A.S. Gielisse (author) , J.C. Van Gemert (author)

Color is a crucial visual cue readily exploited by Convolutional Neural Networks (CNNs) for object recognition. However, CNNs struggle if there is data imbalance between color variations introduced by accidental recording conditions. Color invariance addresses this issue but does ...

Are current long-term video understanding datasets long-term?

Conference paper (2023) - O. Strafforello (author) , Klamer Schutte (author) , J.C. Van Gemert (author)

Many real-world applications, from sport analysis to surveillance, benefit from automatic long-term action recognition. In the current deep learning paradigm for automatic action recognition, it is imperative that models are trained and tested on datasets and tasks that evaluate ...

Video BagNet

Short temporal receptive fields increase robustness in long-term action recognition

Conference paper (2023) - O. Strafforello (author) , Xin Liu (author) , Klamer Schutte (author) , J.C. van Gemert (author)

Previous work on long-term video action recognition relies on deep 3D-convolutional models that have a large temporal receptive field (RF). We argue that these models are not always the best choice for temporal modeling in videos. A large temporal receptive field allows the model ...

Humans Disagree With the IoU for Measuring Object Detector Localization Error

Conference paper (2022) - O. Strafforello (author) , Vanathi Rajasekart (author) , Osman Semih Kayhan (author) , O. Inel (author) , Jan C. Gemert (author)

The localization quality of automatic object detectors is typically evaluated by the Intersection over Union (IoU) score. In this work, we show that humans have a different view on localization quality. To evaluate this, we conduct a survey with more than 70 participants. Results ...

Long-term behaviour recognition in videos with actor-focused region attention

Conference paper (2021) - Luca Ballan (author) , O. Strafforello (author) , Klamer Schutte (author)

Long-Term activities involve humans performing complex, minutes-long actions. Differently than in traditional action recognition, complex activities are normally composed of a set of sub-actions, that can appear in different order, duration, and quantity. These aspects introduce ...