Segmenting actions by aligning video frames to learned prototypes
D. Hoonhout (TU Delft - Electrical Engineering, Mathematics and Computer Science)
Silvia Pintea – Mentor (TU Delft - Pattern Recognition and Bioinformatics)
Jan C. Gemert – Mentor (TU Delft - Pattern Recognition and Bioinformatics)
More Info
expand_more
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.
Abstract
Video temporal action localization is the task of identifying and localizing specific actions or activities within a video stream. Instead of only classifying which actions occur in the video stream, we aim to detect when an action begins and ends. In this work, we focus on solving this task without any supervision. Existing unsupervised methods solve this task by exploiting a combination of spatial and temporal information. We propose a new model that uses a MLP (multilayer perceptron to learn to sample prototype frames from a video. We use the distance between prototypes and video frames given by DTW (dynamic time
warping) as a loss function to update the MLP. The sampled prototypes allow us to find the start and end boundaries of actions, when combined with DTW. Additionally, the prototype frames can be used for video summarization. We analyze our model in a controlled synthetic data setup, to show the weaknesses and strengths of our models. Additionally, we use the Breakfast dataset, and Cholec80 surgery dataset to compare our model to the state-of-the-art models in a real scenario.