On Fine-grained Temporal Emotion Recognition in Video

How to Trade off Recognition Accuracy with Annotation Complexity?

More Info
expand_more

Abstract

Fine-grained emotion recognition is the process of automatically identifying the emotions of users at a fine granularity level, typically in the time intervals of 0.5s to 4s according to the expected duration of emotions. Previous work mainly focused on developing algorithms to recognize only one emotion for a video based on the user feedback after watching the video. These methods are known as post-stimuli emotion recognition. Compared to post-stimuli emotion recognition, fine-grained emotion recognition can provide segment-by-segment prediction results, making it possible to capture the temporal dynamics of users’ emotions when watching videos. The recognition result it provides can be aligned with the video content and tell us which specific content in the video evokes which emotions. Most of the previous works on fine-grained emotion recognition require fine-grained emotion labels to train the recognition algorithm. However, the experiments to collect these fine-grained emotion labels are usually costly and time-consuming. Thus, this thesis focuses on investigating whether we can accurately predict the emotions of users at a fine granularity level with only a limited amount of emotion ground truth labels for training.

We start our technical contribution in Chapter 3 by building up the baseline methods which are trained using fine-grained emotion labels. This can help us understand how accurate the recognition can be if we take advantage of the fine-grained emotion labels. We propose a correlation-based emotion recognition algorithm (CorrNet) to recognize the valence and arousal (V-A) of each instance (fine-grained segment of signals) using physiological signals. CorrNet extracts features both inside each fine-grained signal segment (instance) and between different instances for the same video stimuli (correlation-based features). We found out that, compared to sequential learning, correlation-based instance learning offers advantages of higher recognition accuracy, less overfitting and less computational complexity.

Compared to collecting fine-grained emotion labels, it is easier to collect only one emotion label after the user watched that stimulus (i.e., the post-stimuli emotion labels). Therefore, in the second technical chapter (Chapter 4) of the thesis, we investigate whether the emotions can be recognized at a fine granularity level by training with only post-stimuli emotion labels (i.e., labels users annotated after watching videos), and propose an Emotion recognition algorithm based on Deep Multiple Instance Learning (EDMIL). EDMIL recognizes fine- grained valence and arousal (V-A) labels by identifying which instances represent the post-stimuli V-A annotated by users after watching the videos. Instead of fully-supervised training, the instances are weakly-supervised by the post-stimuli labels in the training stage. Our experiments show that weakly supervised learning can reduce overfitting caused by the temporal mismatch between fine-grained annotations and input signals.

Although the weakly-supervised learning algorithm developed in Chapter 4 can obtain accurate recognition results with only few annotations, it can only identify the annotated (post-stimuli) emotion from the baseline emotion (e.g., neutral) because only post-stimuli labels are used for training. The non-annotated emotions are all categorized as part of the baseline. To overcome this, in Chapter 5, we propose an Emotion recognition algorithm based on Deep Siamese Networks (EmoDSN). EmoDSN recognizes fine-grained valence and arousal (V-A) labels by maximizing the distance metric between signal segments with different V-A labels. According to the experiments we run in this chapter, EmoDSN achieves promising results by using only 5 shots (5 samples in each emotion category) of training data.

Reflecting on the achievements reported in this thesis, we conclude that the fully-supervised algorithm (Chapter 3) can result in more accurate fine-grained emotion recognition results if the annotation quantity is sufficient. The weakly-supervised learning method (Chapter 4) can result in better recognition results at the instance level compared to fully-supervised methods. We also found that the weakly-supervised learning methods can perform the best if users annotate their most salient, but short emotions or their overall and longer-duration (i.e., persisting) emotions. The few-shot learning method (Chapter 5) can obtain more emotion categories (more than the weakly-supervised learning) by using less amount of samples for training (better than the fully-supervised learning). However, the limitation of it is that accurate recognition results can only be achieved by a subject-dependent model.