Multimodal Cross-context Recognition of Negative Interactions

Lefter, Iulia; Rothkrantz, Leon J.M.

doi:10.1109/ACIIW.2017.8272586

Multimodal Cross-context Recognition of Negative Interactions

Conference paper (2017)

Authors

Iulia Lefter System Engineering

Leon J.M. Rothkrantz Interactive Intelligence

Research Group

System Engineering

DOI: https://doi.org/10.1109/ACIIW.2017.8272586

To reference this document use:

http://resolver.tudelft.nl/uuid:fa6205ec-06ef-4515-865c-14ed01e2031c

More Info

expand_more

Published Date

2017

Language

English

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Research Group

System Engineering

Abstract

Negative emotions and stress can impact human-human interactions and eventually lead to aggression. From the perspective of surveillance systems, it is of high importance to recognize as soon as an interaction escalates and human intervention is needed. One of the limitations of deploying a system in real life is that in practice it can only be trained on a limited number of situations. In this paper we examined the generalization capabilities of a trained system given context change. For this purpose we developed scenarios and made audio-visual recordings in four different contexts in which negative interactions might occur. To obtain a quantification of cross-context performance we kept the test context fixed and performed training on itself (cross-validation) and on all the other contexts. To explore whether multiple examples in the training set are beneficial, we also trained the classifier on a merged corpus of the three contexts that were not used for testing. These experiments were done with audio features, video features and audio-visual feature level fusion to investigate which modality generalizes best. We found that context change generates a decrease in performance that is varying with within-contexts similarities. Merging multiple contexts for training in most cases results in performance just below the best predictive single context. Audio is the most robust modality and in most cases the performance of audiovisual fusion is very close to the one of audio.

Files

ACII_CBAR_Lefter_context_chang... (pdf)

(pdf | 2.17 Mb)

License info not available