Incorporating external knowledge has been shown to improve emotion understanding in dialogues by enriching contextual information, such as character motivations, psychological states, and causal relations between events. Filtering and categorizing this information can significant
...
Incorporating external knowledge has been shown to improve emotion understanding in dialogues by enriching contextual information, such as character motivations, psychological states, and causal relations between events. Filtering and categorizing this information can significantly enhance model performance. In this paper, we present an innovative Emotion Recognition in Conversation (ERC) framework, called the Scene-Speaker Emotion Awareness Network (SSEAN), which employs a dual-strategy modeling approach. SSEAN uniquely incorporates external commonsense knowledge describing speaker states into multimodal inputs. Using parallel recurrent networks to separately capture scene-level and speaker-level emotions, the model effectively reduces the accumulation of redundant information within the speaker’s emotional space. Additionally, we introduce an attention-based dynamic screening module to enhance the quality of integrated external commonsense knowledge through three levels: (1) speaker-listener-aware input structuring, (2) role-based segmentation, and (3) context-guided attention refinement. Experiments show that SSEAN outperforms existing state-of-the-art models on two well-adopted benchmark datasets in both single-text modality and multimodal settings.