Multimodal fusion of body movement signals for no-audio speech detection

None, None; None, None; None, None

Multimodal fusion of body movement signals for no-audio speech detection

Conference Paper (2020)

Author(s)

Xinsheng Wang (TU Delft - Multimedia Computing, Xi’an Jiaotong University)

Jihua Zhu (Xi’an Jiaotong University)

O.E. Scharenborg (TU Delft - Multimedia Computing)

Multimedia Computing

To reference this document use:

https://resolver.tudelft.nl/uuid:df6f3dc4-3cd3-4de4-a06a-4f6962236099

More Info

expand_more

Publication Year

2020

Language

English

Multimedia Computing

Volume number

2882

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

No-audio Multimodal Speech Detection is one of the tasks in Media- Eval 2020, with the goal to automatically detect whether someone is speaking in social interaction on the basis of body movement signals. In this paper, a multimodal fusion method, combining signals obtained by an overhead camera and a wearable accelerometer, was proposed to determine whether someone was speaking. The proposed system directly takes the accelerometer signals as input, while using a pre-trained 3D convolutional network to extract the video features that work as input. Experiments on the No-audio Multimodal Speech Detection task show that our method outperforms all submissions of previous years.

Files

Paper8.pdf

(pdf | 0.871 Mb)

License info not available