Multimodal transformer for depression detection based on EEG and interview data

Journal Article (2026)
Author(s)

Nima Esmi (University Medical Center Groningen, Khazar University)

Asadollah Shahbahrami (University of Guilan, Khazar University)

G. Gaydadjiev (TU Delft - Computer Engineering)

Peter de Jonge (University Medical Center Groningen)

Research Group
Computer Engineering
DOI related publication
https://doi.org/10.1016/j.bspc.2025.109039
More Info
expand_more
Publication Year
2026
Language
English
Research Group
Computer Engineering
Volume number
113
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Depression detection benefits from combining neurological and behavioral indicators, yet integrating heterogeneous modalities such as EEG and interview audio remains challenging. We propose a transformer-based multimodal framework that jointly models spectral, spatial, and temporal EEG features alongside linguistic and paralinguistic cues from interviews. By employing synchronized multi-head cross-attention and self-attention mechanisms, the model effectively captures intra- and inter-modal correlations. In addition, a flexible temporal sequence matching strategy reduces EEG channel requirements, enhancing device portability. Evaluated on the MODMA and DAIC-WOZ datasets, our approach achieves superior performance compared to state-of-the-art models, with a 4.7% improvement in accuracy and a 10% increase in precision. These results demonstrate the potential of the proposed framework for accurate, scalable, and cost-effective depression detection in both clinical and real-world settings.