Graph Convolution-Based Decoupling and Consistency-Driven Fusion for Multimodal Emotion Recognition
Yingmin Deng (Xidian University)
Chenyu Li (Xidian University)
Yu Gu (Xidian University)
He Zhang (Northwest University)
Linsong Liu (Xidian University)
Haixiang Lin (TU Delft - Mathematical Physics)
Shuang Wang (Xidian University)
Hanlin Mo (Xidian University)
More Info
expand_more
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.
Abstract
Multimodal emotion recognition (MER) is essential for understanding human emotions from diverse sources such as speech, text, and video. However, modality heterogeneity and inconsistent expression pose challenges for effective feature fusion. To address this, we propose a novel MER framework combining a Dynamic Weighted Graph Convolutional Network (DW-GCN) for feature disentanglement and a Cross-Attention Consistency-Gated Fusion (CACG-Fusion) module for robust integration. DW-GCN models complex inter-modal relationships, enabling the extraction of both common and private features. The CACG-Fusion module subsequently enhances classification performance through dynamic alignment of cross-modal cues, employing attention-based coordination and consistency-preserving gating mechanisms to optimize feature integration. Experiments on the CMU-MOSI and CMU-MOSEI datasets demonstrate that our method achieves state-of-the-art performance, significantly improving the ๐ด๐ถ๐ถ7 , ๐ด๐ถ๐ถ2, and ๐น1 scores.