Multi-frame deep learning models for action detection in surveillance videos

None, None

Multi-frame deep learning models for action detection in surveillance videos

Master Thesis (2019)

Author(s)

T.A.K. Khan (TU Delft - Mechanical Engineering)

Contributor(s)

R Van de Plas – Mentor (TU Delft - Team Raf Van de Plas)

Gertjan Burghouts – Mentor (TNO)

Raimon Pruim – Mentor (TNO)

J. Kober – Graduation committee member (TU Delft - Learning & Autonomous Control)

O.A. Soloviev – Graduation committee member (TU Delft - Team Raf Van de Plas)

Faculty

Mechanical Engineering

Copyright

Deep learning Surveillance Video processing Action detection

To reference this document use:

https://resolver.tudelft.nl/uuid:ca531228-239d-4c22-a1dc-94cd23991cdd

More Info

expand_more

Publication Year

2019

Language

English

Copyright

Graduation Date

26-11-2019

Awarding Institution

Delft University of Technology

Programme

['Mechanical Engineering | Systems and Control']

Faculty

Mechanical Engineering

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Visual surveillance technologies are increasingly being used to monitor public spaces. These technologies process the recordings of surveillance cameras. Such recordings contain depictions of human actions such as "running", "waving", and "aggression". In the field of computer vision, automated detection of human actions in videos is known as action detection. Recently, deep learning models have been proposed for the task of action detection. Deep learning models for this task can be grouped into single-frame models and multi-frame models. Single-frame models detect actions using individual frames of videos whereas multi-frame models detect actions using sequences of frames.
This thesis proposes to use multi-frame models as compared to single-frame models for action detection in surveillance videos. To compare multi-frame and single-frame models, we implement the ACT-detector. The ACT-detector is a deep learning model that takes as input a sequence of K frames and outputs tubelets (labeled sequences of bounding boxes). We train and evaluate ACT for various values of K on the VIRAT dataset. In our comparison, K=1 serves as the single-frame model and K>1 as the multi-frame models. When compared qualitatively, we find that multi-frame models have less missed detections. When compared quantitatively, we find that multi-frame models outperform single-frame models in performance measures such as classification accuracy, MABO, frame-mAP, and video-mAP.
To assess whether the improvements of multi-frame models yield purely from the increased number of frames, or also from the temporal order encoded by those frames, we experiment with training multi-frame models on unordered sequences of frames, i.e., sequences for which the frames are shuffled in time. When compared qualitatively, we find that multi-frame models have less precise localization when trained on unordered sequences. When compared quantitatively, we find that multi-frame models perform worse when trained on unordered sequences, indicating that multi-frame models learn temporal dynamics of actions. Nevertheless, even when trained on unordered sequences, multi-frame models outperform single-frame models for action detection in surveillance videos.

Files

MSc_Thesis_Tiamur_Khan_4247329... (pdf)

(pdf | 15.3 Mb)

License info not available