Few shot emotion recognition using intelligent voice assistants and wearables

Learning from few samples of speech and physiological signals

More Info


Emotion Recognition is one of the vastly studied areas of affective computing. Attempts have been made to design emotion recognition systems for everyday settings. The ubiquitous nature of Intelligent voice assistants (IVAs) in households, make them a great anchor for the introduction of emotion recognition technology to consumers. The existing systems lack such pipelines and rely on dictionary-based architectures in their design. Further, these systems lack conversational properties and are merely an extension of information retrieval engines.

In this setting, we propose to introduce and develop emotion recognition pipelines that are suited to the interactions, common with these IVAs. To augment the existing emotion recognition pipelines which rely on audio information, we look at physiological information derived from wearables. Our proposed model uses multimodal embeddings with a Siamese Network to achieve the task of emotion recognition from a few samples. Physiological signals of blood volume pulse (BVP) and electrodermal activity (EDA) are used as additional input embeddings to two audio embeddings arising from the speech samples. We employ the state-of-the-art training schedules for Siamese Networks, which use a very limited amount of training on support datasets via sample pair comparisons. The performance of the model is evaluated using weighted binary accuracy and f1 scores.

The proposed model is applied on two datasets that denote two unique experimental settings - the K-EmoCon dataset and RECOLA dataset. We demonstrate an improvement in the state-of-the-art accuracy with the K-EmoCon dataset with accuracies of 63.97% and 66.91% on arousal and valence dimensions respectively. Further, on the RECOLA dataset, the model performs moderately well with 53.81% and 53.87% respectively for arousal and valence dimensions. In addition to this, we present a study of the effects of variation of available support set for training from the dataset. We make some salient observations for these experiments across individual participants and also identify how the label distributions affect the performance of the model. Further, we investigate the impact of real-world noise samples from the DEMAND dataset on the two datasets. We observe that the proposed model is robust and performs sustainingly well even in the presence of imputed noise.