Multimodal Cross-context Recognition of Negative Interactions
More Info
expand_more
Abstract
Negative emotions and stress can impact human-human interactions and eventually lead to aggression. From the perspective of surveillance systems, it is of high importance to recognize as soon as an interaction escalates and human intervention is needed. One of the limitations of deploying a system in real life is that in practice it can only be trained on a limited number of situations. In this paper we examined the generalization capabilities of a trained system given context change. For this purpose we developed scenarios and made audio-visual recordings in four different contexts in which negative interactions might occur. To obtain a quantification of cross-context performance we kept the test context fixed and performed training on itself (cross-validation) and on all the other contexts. To explore whether multiple examples in the training set are beneficial, we also trained the classifier on a merged corpus of the three contexts that were not used for testing. These experiments were done with audio features, video features and audio-visual feature level fusion to investigate which modality generalizes best. We found that context change generates a decrease in performance that is varying with within-contexts similarities. Merging multiple contexts for training in most cases results in performance just below the best predictive single context. Audio is the most robust modality and in most cases the performance of audiovisual fusion is very close to the one of audio.