SpeechCAT: Cross-Attentive Transformer for Audio to Motion Generation

Master Thesis (2025)
Author(s)

S. Deaconu (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

Xucong Zhang – Mentor (TU Delft - Pattern Recognition and Bioinformatics)

J.C. van Gemert – Mentor (TU Delft - Pattern Recognition and Bioinformatics)

H Wang – Graduation committee member (TU Delft - Multimedia Computing)

Faculty
Electrical Engineering, Mathematics and Computer Science
More Info
expand_more
Publication Year
2025
Language
English
Graduation Date
19-02-2025
Awarding Institution
Delft University of Technology
Programme
Computer Science
Faculty
Electrical Engineering, Mathematics and Computer Science
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Audio-to-motion generation is an important task with applications in virtual avatar creation for XR systems and intelligent robot control in daily life scenarios.
Most current motion generation methods depend on a single encoder-decoder architecture to simultaneously model all body parts, constraining their capacity to capture the diverse and complex motions exhibited by humans.
In this paper, we propose a novel method, SpeechCAT, that employs three separate encoder-decoder modules to individually model the motions of the face, body, and hands. To capture the relationships and synchronization among these body parts, we introduce a cross-attention mechanism to effectively learn their correlations.
SpeechCAT ensures sufficient capacity to model the unique characteristics of each body part while preserving the coherence between them.
Our experimental results demonstrate the superiority of SpeechCAT over baseline methods, highlighting its effectiveness in generating diverse, realistic, and synchronized motions with face, body, and hand parts.

Files

License info not available