Contrastive Learning of Visual Representations from Unlabeled Videos

More Info
expand_more

Abstract

This thesis presents a novel self-supervised approach of learning visual representations from videos containing human actions. Our approach tackles the complex problem of learning without the need of labeled data by exploring to what extent the ideas successfully used for images can be transferred, adapted and extended to videos for action recognition purposes. We begin by giving a brief introduction to the topic of learning features without having access to a labeled corpora, providing the motivation of our work. We continue with presenting the related research in terms of contrastive learning, action recognition from videos with 3D convolutions and self-supervised techniques for both images and videos. Next, we formalize our approach with regards to the sampling method, the types of spatial and temporal transformations and the contrastive loss used. We evaluate videoSimCLR proposed method in terms of linear evaluation, fully fine-tuning and video retrieval using two popular action recognition datasets, HMDB51 and UCF101. We also explore the extension of another contrastive learning approach to videos, videoMOCO, and compare it with videoSimCLR by means of linear evaluation.