Graph Convolution-Based Decoupling and Consistency-Driven Fusion for Multimodal Emotion Recognition

None, None; None, None; None, None; None, None; None, None; None, None; None, None; None, None

Graph Convolution-Based Decoupling and Consistency-Driven Fusion for Multimodal Emotion Recognition

Journal Article (2025)

Author(s)

Yingmin Deng (Xidian University)

Chenyu Li (Xidian University)

Yu Gu (Xidian University)

He Zhang (Northwest University)

Linsong Liu (Xidian University)

Haixiang Lin (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Shuang Wang (Xidian University)

Hanlin Mo (Xidian University)

Research Group

Mathematical Physics

Multimodal fusion Disentangled representation learning Multimodal emotion recognition

DOI related publication

https://doi.org/10.3390/electronics14153047 Final published version

To reference this document use

https://resolver.tudelft.nl/uuid:5d854e3c-1711-422b-98cb-d9beddc0907b

More Info

expand_more

Publication Year

2025

Language

English

Research Group

Mathematical Physics

Journal title

Electronics (Switzerland)

Issue number

15

Volume number

14

Article number

3047

Downloads counter

131

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Multimodal emotion recognition (MER) is essential for understanding human emotions from diverse sources such as speech, text, and video. However, modality heterogeneity and inconsistent expression pose challenges for effective feature fusion. To address this, we propose a novel MER framework combining a Dynamic Weighted Graph Convolutional Network (DW-GCN) for feature disentanglement and a Cross-Attention Consistency-Gated Fusion (CACG-Fusion) module for robust integration. DW-GCN models complex inter-modal relationships, enabling the extraction of both common and private features. The CACG-Fusion module subsequently enhances classification performance through dynamic alignment of cross-modal cues, employing attention-based coordination and consistency-preserving gating mechanisms to optimize feature integration. Experiments on the CMU-MOSI and CMU-MOSEI datasets demonstrate that our method achieves state-of-the-art performance, significantly improving the 𝐴𝐶𝐶7 , 𝐴𝐶𝐶2, and 𝐹1 scores.

Files

Electronics-14-03047.pdf

(pdf | 1.59 Mb)