Multimodal fusion of body movement signals for no-audio speech detection

Conference Paper (2020)
Author(s)

X. Wang (TU Delft - Multimedia Computing, Xi’an Jiaotong University)

Jihua Zhu (Xi’an Jiaotong University)

O.E. Scharenborg (TU Delft - Multimedia Computing)

Multimedia Computing
More Info
expand_more
Publication Year
2020
Language
English
Multimedia Computing
Volume number
2882
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

No-audio Multimodal Speech Detection is one of the tasks in Media- Eval 2020, with the goal to automatically detect whether someone is speaking in social interaction on the basis of body movement signals. In this paper, a multimodal fusion method, combining signals obtained by an overhead camera and a wearable accelerometer, was proposed to determine whether someone was speaking. The proposed system directly takes the accelerometer signals as input, while using a pre-trained 3D convolutional network to extract the video features that work as input. Experiments on the No-audio Multimodal Speech Detection task show that our method outperforms all submissions of previous years.

Files

Paper8.pdf
(pdf | 0.871 Mb)
License info not available