Scene-Speaker Emotion Aware Network

Dual Network Strategy for Conversational Emotion Recognition

Journal Article (2025)
Author(s)

Bingni Li (Xidian University)

Yu Gu (Xidian University)

Chenyu Li (Xidian University)

He Zhang (Northwest University China)

Linsong Liu (Xidian University)

Hai-Xiang Lin (TU Delft - Mathematical Physics)

Shuang Wang (Xidian University)

Research Group
Mathematical Physics
DOI related publication
https://doi.org/10.3390/electronics14132660
More Info
expand_more
Publication Year
2025
Language
English
Research Group
Mathematical Physics
Issue number
13
Volume number
14
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Incorporating external knowledge has been shown to improve emotion understanding in dialogues by enriching contextual information, such as character motivations, psychological states, and causal relations between events. Filtering and categorizing this information can significantly enhance model performance. In this paper, we present an innovative Emotion Recognition in Conversation (ERC) framework, called the Scene-Speaker Emotion Awareness Network (SSEAN), which employs a dual-strategy modeling approach. SSEAN uniquely incorporates external commonsense knowledge describing speaker states into multimodal inputs. Using parallel recurrent networks to separately capture scene-level and speaker-level emotions, the model effectively reduces the accumulation of redundant information within the speaker’s emotional space. Additionally, we introduce an attention-based dynamic screening module to enhance the quality of integrated external commonsense knowledge through three levels: (1) speaker-listener-aware input structuring, (2) role-based segmentation, and (3) context-guided attention refinement. Experiments show that SSEAN outperforms existing state-of-the-art models on two well-adopted benchmark datasets in both single-text modality and multimodal settings.