TZ

T. Zhang

info

Please Note

7 records found

Journal article (2023) - Tianyi Zhang, Abdallah El Ali, Chen Wang, Alan Hanjalic, Pablo Cesar
Instead of predicting just one emotion for one activity (e.g., video watching), fine-grained emotion recognition enables more temporally precise recognition. Previous works on fine-grained emotion recognition require segment-by-segment, fine-grained emotion labels to train the recognition algorithm. However, experiments to collect these labels are costly and time-consuming compared with only collecting one emotion label after the user watched that stimulus (i.e., the post-stimuli emotion labels). To recognize emotions at a finer granularity level when trained with only post-stimuli labels, we propose an emotion recognition algorithm based on Deep Multiple Instance Learning (EDMIL) using physiological signals. EDMIL recognizes fine-grained valence and arousal (V-A) labels by identifying which instances represent the post-stimuli V-A annotated by users after watching the videos. Instead of fully-supervised training, the instances are weakly-supervised by the post-stimuli labels in the training stage. The V-A of instances are estimated by the instance gains, which indicate the probability of instances to predict the post-stimuli labels. We tested EDMIL on three different datasets, CASE, MERCA and CEAP-360VR, collected in three different environments: desktop, mobile and HMD-based Virtual Reality, respectively. Recognition results validated with the fine-grained V-A self-reports show that for subject-independent 3-class classification (high/neutral/low), EDMIL obtains promising recognition accuracies: 75.63% and 79.73% for V-A on CASE, 70.51% and 67.62% for V-A on MERCA and 65.04% and 67.05% for V-A on CEAP-360VR. Our ablation study shows that all components of EDMIL contribute to both the classification and regression tasks. Our experiments also show that (1) compared with fully-supervised learning, weakly-supervised learning can reduce the problem of overfitting caused by the temporal mismatch between fine-grained annotations and physiological signals, (2) instance segment lengths between 1-2 s result in the highest recognition accuracies and (3) EDMIL performs best if post-stimuli annotations consist of less than 30% or more than 60% of the entire video watching. ...

How to Trade off Recognition Accuracy with Annotation Complexity?

Doctoral thesis (2022) - Tianyi Zhang
Fine-grained emotion recognition is the process of automatically identifying the emotions of users at a fine granularity level, typically in the time intervals of 0.5s to 4s according to the expected duration of emotions. Previous work mainly focused on developing algorithms to recognize only one emotion for a video based on the user feedback after watching the video. These methods are known as post-stimuli emotion recognition. Compared to post-stimuli emotion recognition, fine-grained emotion recognition can provide segment-by-segment prediction results, making it possible to capture the temporal dynamics of users’ emotions when watching videos. The recognition result it provides can be aligned with the video content and tell us which specific content in the video evokes which emotions. Most of the previous works on fine-grained emotion recognition require fine-grained emotion labels to train the recognition algorithm. However, the experiments to collect these fine-grained emotion labels are usually costly and time-consuming. Thus, this thesis focuses on investigating whether we can accurately predict the emotions of users at a fine granularity level with only a limited amount of emotion ground truth labels for training. We start our technical contribution in Chapter 3 by building up the baseline methods which are trained using fine-grained emotion labels. This can help us understand how accurate the recognition can be if we take advantage of the fine-grained emotion labels. We propose a correlation-based emotion recognition algorithm (CorrNet) to recognize the valence and arousal (V-A) of each instance (fine-grained segment of signals) using physiological signals. CorrNet extracts features both inside each fine-grained signal segment (instance) and between different instances for the same video stimuli (correlation-based features). We found out that, compared to sequential learning, correlation-based instance learning offers advantages of higher recognition accuracy, less overfitting and less computational complexity. Compared to collecting fine-grained emotion labels, it is easier to collect only one emotion label after the user watched that stimulus (i.e., the post-stimuli emotion labels). Therefore, in the second technical chapter (Chapter 4) of the thesis, we investigate whether the emotions can be recognized at a fine granularity level by training with only post-stimuli emotion labels (i.e., labels users annotated after watching videos), and propose an Emotion recognition algorithm based on Deep Multiple Instance Learning (EDMIL). EDMIL recognizes fine- grained valence and arousal (V-A) labels by identifying which instances represent the post-stimuli V-A annotated by users after watching the videos. Instead of fully-supervised training, the instances are weakly-supervised by the post-stimuli labels in the training stage. Our experiments show that weakly supervised learning can reduce overfitting caused by the temporal mismatch between fine-grained annotations and input signals. Although the weakly-supervised learning algorithm developed in Chapter 4 can obtain accurate recognition results with only few annotations, it can only identify the annotated (post-stimuli) emotion from the baseline emotion (e.g., neutral) because only post-stimuli labels are used for training. The non-annotated emotions are all categorized as part of the baseline. To overcome this, in Chapter 5, we propose an Emotion recognition algorithm based on Deep Siamese Networks (EmoDSN). EmoDSN recognizes fine-grained valence and arousal (V-A) labels by maximizing the distance metric between signal segments with different V-A labels. According to the experiments we run in this chapter, EmoDSN achieves promising results by using only 5 shots (5 samples in each emotion category) of training data. Reflecting on the achievements reported in this thesis, we conclude that the fully-supervised algorithm (Chapter 3) can result in more accurate fine-grained emotion recognition results if the annotation quantity is sufficient. The weakly-supervised learning method (Chapter 4) can result in better recognition results at the instance level compared to fully-supervised methods. We also found that the weakly-supervised learning methods can perform the best if users annotate their most salient, but short emotions or their overall and longer-duration (i.e., persisting) emotions. The few-shot learning method (Chapter 5) can obtain more emotion categories (more than the weakly-supervised learning) by using less amount of samples for training (better than the fully-supervised learning). However, the limitation of it is that accurate recognition results can only be achieved by a subject-dependent model. ...
Journal article (2022) - Tianyi Zhang, Abdallah El Ali, Alan Hanjalic, Pablo Cesar
Fine-grained emotion recognition can model the temporal dynamics of emotions, which is more precise than predicting one emotion retrospectively for an activity (e.g., video clip watching). Previous works require large amounts of continuously annotated data to train an accurate recognition model, however experiments to collect such large amounts of continuously annotated physiological signals are costly and time-consuming. To overcome this challenge, we propose an Emotion recognition algorithm based on Deep Siamese Networks (EmoDSN) which can rapidly converge on a small amount of training data, typically less than 10 samples per class (i.e., <10 shot). EmoDSN recognizes fine-grained valence and arousal (V-A) labels by maximizing the distance metric between signal segments with different V-A labels. We tested EmoDSN on three different datasets collected in three different environments: desktop, mobile and HMD-based virtual reality, respectively. The results from our experiments show that EmoDSN achieves promising results for both one-dimension binary (high/low V-A, 1D-2 C) and two-dimensional 5-class (four quadrants of V- A space + neutral, 2D-5 C) classification. We get an averaged accuracy of 76.04, 76.62 and 57.62% for 1D-2 C valence, 1D-2 C arousal, and 2D-5 C, respectively, by using only 5 shots of training data. Our experiments show that EmoDSN can achieve better results if we select training samples from the changing points of emotion or the ending moments of video watching. ...

A Continuous Physiological and Behavioral Emotion Annotation Dataset for 360 VR Videos

Journal article (2021) - Tong Xue, Abdallah El Ali, Tianyi Zhang, Gangyi Ding, Pablo Cesar
Watching 360 videos using Virtual Reality (VR) head-mounted displays (HMDs) provides interactive and immersive experiences, where videos can evoke different emotions. Existing emotion self-report techniques within VR however are either retrospective or interrupt the immersive experience. To address this, we introduce the Continuous Physiological and Behavioral Emotion Annotation Dataset for 360 Videos (CEAP-360VR). We conducted a controlled study (N=32) where participants used a Vive Pro Eye HMD to watch eight validated affective 360 video clips, and annotated their valence and arousal (V-A) continuously. We collected (a) behavioral (head and eye movements; pupillometry) signals (b) physiological (heart rate, skin temperature, electrodermal activity) responses (c) momentary emotion self-reports (d) within-VR discrete emotion ratings (e) motion sickness, presence, and workload. We show the consistency of continuous annotation trajectories and verify their mean V-A annotations. We find high consistency between viewed 360 video regions across subjects, with higher consistency for eye than head movements. We furthermore run baseline classification experiments, where Random Forest classifiers with 2s segments show good accuracies for subject-independent models: 66.80% (V) and 64.26% (A) for binary classification; 49.92% (V) and 52.20% (A) for 3-class classification. Our open dataset allows further experiments with continuous emotion self-reports collected in 360 VR environments, which can enable automatic assessment of immersive Quality of Experience (QoE) andmomentary affective states. ...

Real-time, Continuous Emotion Annotation for Collecting Precise Mobile Video Ground Truth Labels

Conference paper (2020) - Tianyi Zhang, Abdallah El Ali, Chen Wang, Alan Hanjalic, Pablo Cesar
Collecting accurate and precise emotion ground truth labels for mobile video watching is essential for ensuring meaningful predictions. However, video-based emotion annotation techniques either rely on post-stimulus discrete self-reports, or allow real-time, continuous emotion annotations (RCEA) only for desktop settings. Following a user-centric approach, we designed an RCEA technique for mobile video watching, and validated its usability and reliability in a controlled, indoor (N=12) and later outdoor (N=20) study. Drawing on physiological measures, interaction logs, and subjective workload reports, we show that (1) RCEA is perceived to be usable for annotating emotions while mobile video watching, without increasing users' mental workload (2) the resulting time-variant annotations are comparable with intended emotion attributes of the video stimuli (classification error for valence: 8.3%; arousal: 25%). We contribute a validated annotation technique and associated annotation fusion method, that is suitable for collecting fine-grained emotion annotations while users watch mobile videos. ...

Fine-grained emotion recognition for video watching using wearable physiological sensors

Journal article (2020) - Tianyi Zhang, Abdallah El Ali, Chen Wang, Alan Hanjalic, Pablo Cesar
Recognizing user emotions while they watch short-form videos anytime and anywhere is essential for facilitating video content customization and personalization. However, most works either classify a single emotion per video stimuli, or are restricted to static, desktop environments. To address this, we propose a correlation-based emotion recognition algorithm (CorrNet) to recognize the valence and arousal (V-A) of each instance (fine-grained segment of signals) using only wearable, physiological signals (e.g., electrodermal activity, heart rate). CorrNet takes advantage of features both inside each instance (intra-modality features) and between different instances for the same video stimuli (correlation-based features). We first test our approach on an indoor-desktop affect dataset (CASE), and thereafter on an outdoor-mobile affect dataset (MERCA) which we collected using a smart wristband and wearable eyetracker. Results show that for subject-independent binary classification (high-low), CorrNet yields promising recognition accuracies: 76.37% and 74.03% for V-A on CASE, and 70.29% and 68.15% for V-A on MERCA. Our findings show: (1) instance segment lengths between 1–4 s result in highest recognition accuracies (2) accuracies between laboratory-grade and wearable sensors are comparable, even under low sampling rates (≤64 Hz) (3) large amounts of neu-tral V-A labels, an artifact of continuous affect annotation, result in varied recognition performance. ...

Correlation-based feature extraction algorithm using skin conductance and pupil diameter for emotion recognition

Conference paper (2019) - Tianyi Zhang, Abdallah El Ali, Chen Wang, Xintong Zhu, Pablo Cesar
To recognize emotions using less obtrusive wearable sensors, we present a novel emotion recognition method that uses only pupil diameter (PD) and skin conductance (SC). Psychological studies show that these two signals are related to the attention level of humans exposed to visual stimuli. Based on this, we propose a feature extraction algorithm that extract correlation-based features for participants watching the same video clip. To boost performance given limited data, we implement a learning system without a deep architecture to classify arousal and valence. Our method outperforms not only state-of-art approaches, but also widely-used traditional and deep learning methods. ...