<p>This page displays the records of the person named above and is not linked to a unique person identifier. This record may need to be merged to a profile.</p>
In temporal action localization, given an input video, the goal is to predict the action that is present in the video, along with its temporal boundaries. Several powerful models have been proposed throughout the years, with transformer-based models achieving state-of-the-art performance in the recent months. Although novel models are becoming more and more accurate, authors rarely study how limited training data or computation power environments affect the performance of their model. This study is carried out on TriDet, a transformer-based temporal action localization model that achieves state-of-the-art performance on two different benchmarks. It evaluates the model’s behavior in a limited training data and computation power environment. It is found that TriDet achieves close to state-of-the-art performance when only 60% of the training data or approximately 90 action instances per class are used. It is also notable that inference time, memory usage, multiply-accumulate operations and GPU utilization scale linearly along with the length of the tensor that is passed to the model. These findings, combined with TriDet’s mean training time of 11 minutes on the THUMOS’14 dataset can be used to determine the model’s hypothetical behavior when run in lower computation power environments.
...
In temporal action localization, given an input video, the goal is to predict the action that is present in the video, along with its temporal boundaries. Several powerful models have been proposed throughout the years, with transformer-based models achieving state-of-the-art performance in the recent months. Although novel models are becoming more and more accurate, authors rarely study how limited training data or computation power environments affect the performance of their model. This study is carried out on TriDet, a transformer-based temporal action localization model that achieves state-of-the-art performance on two different benchmarks. It evaluates the model’s behavior in a limited training data and computation power environment. It is found that TriDet achieves close to state-of-the-art performance when only 60% of the training data or approximately 90 action instances per class are used. It is also notable that inference time, memory usage, multiply-accumulate operations and GPU utilization scale linearly along with the length of the tensor that is passed to the model. These findings, combined with TriDet’s mean training time of 11 minutes on the THUMOS’14 dataset can be used to determine the model’s hypothetical behavior when run in lower computation power environments.
Temporal Action Localization (TAL) aims to localize the start and end times of actions in untrimmed videos and classify the corresponding action types. TAL plays an important role in understanding video. Existing TAL approaches heavily rely on deep learning and require large-scale data and expensive training processes. Recent advances in Contrastive Language-Image Pre-Training (CLIP) have brought vision-language modeling into the field of TAL. While current CLIP-based TAL methods have been proven to be effective, their capabilities under data and compute-limited settings are not explored. In this paper, we have investigated the data and compute efficiencies of the CLIP-based STALE model. We evaluate the model performances under data-limited open/close-set scenarios. We find that STALE can demonstrate adequate generalizability using limited data. We experimented with the training time, inference time, GPU utilization, MACs, and memory consumption of STALE by inputting with varying video lengths. We discover an optimal input length for STALE to inference. Using model quantization, we find a significant forward time reduction for STALE on a single CPU. Our findings shed light on the capabilities and limitations of CLIP-based TAL methods under constrained data and compute resources. The insights gained from this research contribute to enhancing the efficiency and applicability of CLIP-based TAL techniques in real-world scenarios. The results provide valuable guidance for future advancements in CLIP-based TAL models and their potential for broader adoption in resource-constrained environments.
...
Temporal Action Localization (TAL) aims to localize the start and end times of actions in untrimmed videos and classify the corresponding action types. TAL plays an important role in understanding video. Existing TAL approaches heavily rely on deep learning and require large-scale data and expensive training processes. Recent advances in Contrastive Language-Image Pre-Training (CLIP) have brought vision-language modeling into the field of TAL. While current CLIP-based TAL methods have been proven to be effective, their capabilities under data and compute-limited settings are not explored. In this paper, we have investigated the data and compute efficiencies of the CLIP-based STALE model. We evaluate the model performances under data-limited open/close-set scenarios. We find that STALE can demonstrate adequate generalizability using limited data. We experimented with the training time, inference time, GPU utilization, MACs, and memory consumption of STALE by inputting with varying video lengths. We discover an optimal input length for STALE to inference. Using model quantization, we find a significant forward time reduction for STALE on a single CPU. Our findings shed light on the capabilities and limitations of CLIP-based TAL methods under constrained data and compute resources. The insights gained from this research contribute to enhancing the efficiency and applicability of CLIP-based TAL techniques in real-world scenarios. The results provide valuable guidance for future advancements in CLIP-based TAL models and their potential for broader adoption in resource-constrained environments.
This paper presents an analysis of the data and compute efficiency of the TemporalMaxer deep learning model in the context of temporal action localization (TAL), which involves accurately detecting the start and end times of specific video actions. The study explores the performance and scalability of the TemporalMaxer model under limited resources and data availability, focusing on factors such as hardware requirements, training time, and data utilization, thus contributing to the advancement of efficient deep learning models for real-world video tasks. Through a literature review of temporal action recognition models, evaluation of learning curves for data efficiency, and development of metrics to assess the compute efficiency, the study provides insights into the performance trade-offs of the TemporalMaxer model. Experiments conducted on the widely used THUMOS dataset further demonstrate the model's generalizability with limited data, achieving significant accuracy performance with only 50% of the training data. Notably, TemporalMaxer exhibits superior compute efficiency by significantly reducing the number of Multiply-Accumulate operations (MACs) compared to other state-of-the-art models. However, alternative models like TriDet and TadTR outperform TemporalMaxer in training time-constrained scenarios. These findings shed light on the model's practical applicability in resource-constrained environments, offering insights for further optimization and study.
...
This paper presents an analysis of the data and compute efficiency of the TemporalMaxer deep learning model in the context of temporal action localization (TAL), which involves accurately detecting the start and end times of specific video actions. The study explores the performance and scalability of the TemporalMaxer model under limited resources and data availability, focusing on factors such as hardware requirements, training time, and data utilization, thus contributing to the advancement of efficient deep learning models for real-world video tasks. Through a literature review of temporal action recognition models, evaluation of learning curves for data efficiency, and development of metrics to assess the compute efficiency, the study provides insights into the performance trade-offs of the TemporalMaxer model. Experiments conducted on the widely used THUMOS dataset further demonstrate the model's generalizability with limited data, achieving significant accuracy performance with only 50% of the training data. Notably, TemporalMaxer exhibits superior compute efficiency by significantly reducing the number of Multiply-Accumulate operations (MACs) compared to other state-of-the-art models. However, alternative models like TriDet and TadTR outperform TemporalMaxer in training time-constrained scenarios. These findings shed light on the model's practical applicability in resource-constrained environments, offering insights for further optimization and study.
In temporal action localization, given an input video, the goal is to predict which actions it contains, where they begin and where they end. Training and testing current state-of-the-art, deep learning models is done assuming access to large amounts of data and computational power. Gathering such data is however a challenging task and access to computational resources might be limited. This work thus explores and measures how well one of such deep learning models, ActionFormer, performs in settings constrained by the amount of data or computational power. Data efficiency was measured by training the model on a subset of the training set and testing on the test set. Although ActionFormer showed promising results on both THUMOS'14 and ActivityNet datasets, TriDet and TemporalMaxer models should likely be chosen in favor of ActionFormer in limited data settings as they exhibit better data efficiency. Similarly, the TriDet model should be chosen in favor of ActionFormer in cases where the training time is limited, as it showed better computational efficiency during training. To test the efficiency of the model during inference, videos of different lengths were passed through the model. Most importantly, we find that both the inference time and the memory usage of the model scale linearly with input video length, as predicted by the authors of the ActionFormer.
...
In temporal action localization, given an input video, the goal is to predict which actions it contains, where they begin and where they end. Training and testing current state-of-the-art, deep learning models is done assuming access to large amounts of data and computational power. Gathering such data is however a challenging task and access to computational resources might be limited. This work thus explores and measures how well one of such deep learning models, ActionFormer, performs in settings constrained by the amount of data or computational power. Data efficiency was measured by training the model on a subset of the training set and testing on the test set. Although ActionFormer showed promising results on both THUMOS'14 and ActivityNet datasets, TriDet and TemporalMaxer models should likely be chosen in favor of ActionFormer in limited data settings as they exhibit better data efficiency. Similarly, the TriDet model should be chosen in favor of ActionFormer in cases where the training time is limited, as it showed better computational efficiency during training. To test the efficiency of the model during inference, videos of different lengths were passed through the model. Most importantly, we find that both the inference time and the memory usage of the model scale linearly with input video length, as predicted by the authors of the ActionFormer.
Bounding boxes are often used to communicate automatic object detection results to humans, aiding humans in a multitude of tasks. We investigate the relationship between bounding box localization errors and human task performance. We use observer performance studies on a visual multi-object counting task to measure both human trust and performance with different levels of bounding box accuracy. The results show that localization errors have no significant impact on human accuracy or trust in the system. Recall and precision errors impact both human performance and trust, suggesting that optimizing algorithms based on the F1 score is more beneficial in human-computer tasks. Lastly, the paper offers an improvement on bounding boxes in multi-object counting tasks with center dots, showing improved performance and better resilience to localization inaccuracy.
...
Bounding boxes are often used to communicate automatic object detection results to humans, aiding humans in a multitude of tasks. We investigate the relationship between bounding box localization errors and human task performance. We use observer performance studies on a visual multi-object counting task to measure both human trust and performance with different levels of bounding box accuracy. The results show that localization errors have no significant impact on human accuracy or trust in the system. Recall and precision errors impact both human performance and trust, suggesting that optimizing algorithms based on the F1 score is more beneficial in human-computer tasks. Lastly, the paper offers an improvement on bounding boxes in multi-object counting tasks with center dots, showing improved performance and better resilience to localization inaccuracy.
Event-based cameras represent a new alternative to traditional frame based sensors, with advantages in lower output bandwidth, lower latency and higher dynamic range, thanks to their independent, asynchronous pixels. These advantages prompted the development of computer vision methods on event data in the last decade, however event-based datasets are still in early stages in terms of size and complexity compared to normal datasets (e.g. ImageNet). This paper explores event data augmentation by superimposing two existing event datasets (N-MNIST and N-Caltech101) and by adding uniform noise. It shows that training an instance segmentation model on noisy datasets does not improve its performance, but the amount and type of noise added in the background decreases the performance of such model.
...
Event-based cameras represent a new alternative to traditional frame based sensors, with advantages in lower output bandwidth, lower latency and higher dynamic range, thanks to their independent, asynchronous pixels. These advantages prompted the development of computer vision methods on event data in the last decade, however event-based datasets are still in early stages in terms of size and complexity compared to normal datasets (e.g. ImageNet). This paper explores event data augmentation by superimposing two existing event datasets (N-MNIST and N-Caltech101) and by adding uniform noise. It shows that training an instance segmentation model on noisy datasets does not improve its performance, but the amount and type of noise added in the background decreases the performance of such model.
Instance segmentation on data from Dynamic Vision Sensors (DVS) is an important computer vision task that needs to be tackled in order to push the research forward on these types of inputs. This paper aims to show that deep learning based techniques can be used to solve the task of instance segmentation on DVS data. A high performing model was used to solve this task, using event-based data that was transformed into RGB-D images. The chosen model for this work was Mask R-CNN, with an alteration for depth images, because of its high performance on frame based data. The N-MNIST dataset provides the event-based input, and the transformation of such an input is presented in this study. Furthermore, the masks are generated with the help of the MNIST dataset and heuristics are used for placing them at the correct positions. The results are promising and comparable to other results from literature on the task of semantic segmentation.
...
Instance segmentation on data from Dynamic Vision Sensors (DVS) is an important computer vision task that needs to be tackled in order to push the research forward on these types of inputs. This paper aims to show that deep learning based techniques can be used to solve the task of instance segmentation on DVS data. A high performing model was used to solve this task, using event-based data that was transformed into RGB-D images. The chosen model for this work was Mask R-CNN, with an alteration for depth images, because of its high performance on frame based data. The N-MNIST dataset provides the event-based input, and the transformation of such an input is presented in this study. Furthermore, the masks are generated with the help of the MNIST dataset and heuristics are used for placing them at the correct positions. The results are promising and comparable to other results from literature on the task of semantic segmentation.
The event-based camera represents a revolutionary concept, having an asynchronous output. The pixels of dynamic vision sensors react to the brightness change, resulting in streams of events at very small intervals of time. This paper provides a model to track objects in neuromorphic datasets, using clustering. In addition, a non-linear filter is applied to correct the estimation of the object position. Both single and multi-object tracking algorithms are provided and their performance is analyzed using different metrics, including the clustering evaluation scores and the tracking accuracy. The accuracy is over 0.6 for multi-target tracking and more than 0.7 for single object tracking. Besides the proposed model, a comparison between different possible approaches for event-based data tracking is provided.
...
The event-based camera represents a revolutionary concept, having an asynchronous output. The pixels of dynamic vision sensors react to the brightness change, resulting in streams of events at very small intervals of time. This paper provides a model to track objects in neuromorphic datasets, using clustering. In addition, a non-linear filter is applied to correct the estimation of the object position. Both single and multi-object tracking algorithms are provided and their performance is analyzed using different metrics, including the clustering evaluation scores and the tracking accuracy. The accuracy is over 0.6 for multi-target tracking and more than 0.7 for single object tracking. Besides the proposed model, a comparison between different possible approaches for event-based data tracking is provided.
Event-based cameras do not capture frames like an RGB camera, only data from pixels that detect a change in light intensity, making it a better alternative for processing videos. The sparse data acquired from event-based video only captures movement in an asynchronous way. In this paper an evaluation is made on the efficiency and accuracy of object detection, specifically localization, between sparse and dense representations of data. Convolutional Neural Networks are used to train and test on images and event-based data. The results show a positive trade-off in terms of accuracy and efficiency for using sparse event-based data instead of dense data like images. These results provide a basis for an argument to use event-based cameras instead of RGB cameras when dealing with object detection.
...
Event-based cameras do not capture frames like an RGB camera, only data from pixels that detect a change in light intensity, making it a better alternative for processing videos. The sparse data acquired from event-based video only captures movement in an asynchronous way. In this paper an evaluation is made on the efficiency and accuracy of object detection, specifically localization, between sparse and dense representations of data. Convolutional Neural Networks are used to train and test on images and event-based data. The results show a positive trade-off in terms of accuracy and efficiency for using sparse event-based data instead of dense data like images. These results provide a basis for an argument to use event-based cameras instead of RGB cameras when dealing with object detection.
In the problem of video summarization, the goal is to select a subset of the input frames conveying the most important information of the input video. The collection of data proves to be a challenging task. In part because there exists a disagreement among human annotators on what segments of a video should be considered important for a summary. In this study we analyse a new dataset created with the goal of increasing agreement between the human annotators. The dataset has been created with the use of a novel annotation method, which uses existing action localization labels for segmenting the videos. We train a supervised and an unsupervised deep learning framework on popularly used benchmark datasets and the new dataset. Experimental results show the effectiveness of this novel summary annotation method in improving the agreement between annotators. Analysis reveals some issues with the evaluation of the deep learning framework.
...
In the problem of video summarization, the goal is to select a subset of the input frames conveying the most important information of the input video. The collection of data proves to be a challenging task. In part because there exists a disagreement among human annotators on what segments of a video should be considered important for a summary. In this study we analyse a new dataset created with the goal of increasing agreement between the human annotators. The dataset has been created with the use of a novel annotation method, which uses existing action localization labels for segmenting the videos. We train a supervised and an unsupervised deep learning framework on popularly used benchmark datasets and the new dataset. Experimental results show the effectiveness of this novel summary annotation method in improving the agreement between annotators. Analysis reveals some issues with the evaluation of the deep learning framework.
There is growing research on automated video summarization following the rise of video content. However, the subjectivity of the task itself is still an issue to address. This subjectivity stems from the fact that there can be different summaries for the same video depending on which parts one considers important. Supervised models especially suffer from this problem as they need informative labels to learn from. As a result, upon evaluation, supervised models appear to perform worse than unsupervised models. This inspired our research on whether action localization can aid the video summarization process. To investigate this issue, this paper will answer the question of how well VASNet, a supervised video summarization model, can predict summaries for videos in an action localization dataset. This involves investigating whether action localization can produce well-correlated human-generated summaries and how it affects the quality of predicted summaries. Our findings reveal that there is a positive indication that action localization can aid in producing more well-correlated human summaries. In addition, we have observed that upon comparison with several video summarization models, VASNet has performed well and that in general, supervised models appear to outperform unsupervised ones when trained with an action localization dataset.
...
There is growing research on automated video summarization following the rise of video content. However, the subjectivity of the task itself is still an issue to address. This subjectivity stems from the fact that there can be different summaries for the same video depending on which parts one considers important. Supervised models especially suffer from this problem as they need informative labels to learn from. As a result, upon evaluation, supervised models appear to perform worse than unsupervised models. This inspired our research on whether action localization can aid the video summarization process. To investigate this issue, this paper will answer the question of how well VASNet, a supervised video summarization model, can predict summaries for videos in an action localization dataset. This involves investigating whether action localization can produce well-correlated human-generated summaries and how it affects the quality of predicted summaries. Our findings reveal that there is a positive indication that action localization can aid in producing more well-correlated human summaries. In addition, we have observed that upon comparison with several video summarization models, VASNet has performed well and that in general, supervised models appear to outperform unsupervised ones when trained with an action localization dataset.
Video summarization is a task which many researchers have tried to automate with deep learning methods. One of these methods is the SUM-GAN-AAE algorithm developed by Apostolidis et al. which is an unsupervised machine learning method evaluated in this study. The research aims at testing the algorithm's performance on the Breakfast dataset, which is an action localization dataset, and evaluate it with rank correlation coefficients. Parameter optimization was performed to tune the learning rate of the system according to the Breakfast dataset. Then, by using k-fold cross-validation, three metrics were used to evaluate the trained model - F-Score, Kendall's τ and Spearman's ρ. Analysis of the results indicates a high F-Score as reported by the SUM-GAN-AAE paper but low rank correlation coefficients. Moreover, plotting importance scores per frame demonstrates the algorithm's inability to select key frames. The findings suggest that F-Score is not a fitting metric to use in the context of video summarization and the SUM-GAN-AAE algorithm performs poorly not only on action localization datasets but also on video summarization ones such as SumMe.
...
Video summarization is a task which many researchers have tried to automate with deep learning methods. One of these methods is the SUM-GAN-AAE algorithm developed by Apostolidis et al. which is an unsupervised machine learning method evaluated in this study. The research aims at testing the algorithm's performance on the Breakfast dataset, which is an action localization dataset, and evaluate it with rank correlation coefficients. Parameter optimization was performed to tune the learning rate of the system according to the Breakfast dataset. Then, by using k-fold cross-validation, three metrics were used to evaluate the trained model - F-Score, Kendall's τ and Spearman's ρ. Analysis of the results indicates a high F-Score as reported by the SUM-GAN-AAE paper but low rank correlation coefficients. Moreover, plotting importance scores per frame demonstrates the algorithm's inability to select key frames. The findings suggest that F-Score is not a fitting metric to use in the context of video summarization and the SUM-GAN-AAE algorithm performs poorly not only on action localization datasets but also on video summarization ones such as SumMe.
This work applies the theory of group equivariance to the domain of video action recognition replacing standard 3Dconvolutions with group convolutions which are equivariant to temporal direction, and multiples of 90-degree spatial rotations. We propose a temporal direction symmetry group T2 and extend the standard planar rotations group to three dimensions to form a 3D group that is equivariant to discrete 90-degree spatial rotations. We analyse the efficacy of using these 3D-G-CNNs as drop-in replacements in 3D networks by evaluating synthesized datasets containing handwritten MNIST digits moving over a black background, as well as popular action recognition datasets UCF-101and HMDB-51, and comparing the results against the performance of the standard 3D CNNs on the datasets.
...
This work applies the theory of group equivariance to the domain of video action recognition replacing standard 3Dconvolutions with group convolutions which are equivariant to temporal direction, and multiples of 90-degree spatial rotations. We propose a temporal direction symmetry group T2 and extend the standard planar rotations group to three dimensions to form a 3D group that is equivariant to discrete 90-degree spatial rotations. We analyse the efficacy of using these 3D-G-CNNs as drop-in replacements in 3D networks by evaluating synthesized datasets containing handwritten MNIST digits moving over a black background, as well as popular action recognition datasets UCF-101and HMDB-51, and comparing the results against the performance of the standard 3D CNNs on the datasets.