Visual surveillance technologies are increasingly being used to monitor public spaces. These technologies process the recordings of surveillance cameras. Such recordings contain depictions of human actions such as "running", "waving", and "aggression". In the field of computer vi
...
Visual surveillance technologies are increasingly being used to monitor public spaces. These technologies process the recordings of surveillance cameras. Such recordings contain depictions of human actions such as "running", "waving", and "aggression". In the field of computer vision, automated detection of human actions in videos is known as action detection. Recently, deep learning models have been proposed for the task of action detection. Deep learning models for this task can be grouped into single-frame models and multi-frame models. Single-frame models detect actions using individual frames of videos whereas multi-frame models detect actions using sequences of frames.
This thesis proposes to use multi-frame models as compared to single-frame models for action detection in surveillance videos. To compare multi-frame and single-frame models, we implement the ACT-detector. The ACT-detector is a deep learning model that takes as input a sequence of K frames and outputs tubelets (labeled sequences of bounding boxes). We train and evaluate ACT for various values of K on the VIRAT dataset. In our comparison, K=1 serves as the single-frame model and K>1 as the multi-frame models. When compared qualitatively, we find that multi-frame models have less missed detections. When compared quantitatively, we find that multi-frame models outperform single-frame models in performance measures such as classification accuracy, MABO, frame-mAP, and video-mAP.
To assess whether the improvements of multi-frame models yield purely from the increased number of frames, or also from the temporal order encoded by those frames, we experiment with training multi-frame models on unordered sequences of frames, i.e., sequences for which the frames are shuffled in time. When compared qualitatively, we find that multi-frame models have less precise localization when trained on unordered sequences. When compared quantitatively, we find that multi-frame models perform worse when trained on unordered sequences, indicating that multi-frame models learn temporal dynamics of actions. Nevertheless, even when trained on unordered sequences, multi-frame models outperform single-frame models for action detection in surveillance videos.