Audio-Visual Wake Word Spotting in MISP2021 Challenge

Dataset Release and Deep Analysis

Journal Article (2022)
Author(s)

Hengshun Zhou (University of Science and Technology of China)

Jun Du (University of Science and Technology of China)

Gongzhen Zou (University of Science and Technology of China)

Zhaoxu Nian (University of Science and Technology of China)

Chin Hui Lee (Georgia Institute of Technology)

Sabato Marco Siniscalchi (Georgia Institute of Technology, University of Enna Kore)

Shinji Watanabe (Carnegie Mellon University)

Odette Scharenborg (TU Delft - Multimedia Computing)

Jingdong Chen (Northwestern Polytechnical University)

undefined More Authors (External organisation)

Research Group
Multimedia Computing
DOI related publication
https://doi.org/10.21437/Interspeech.2022-10650
More Info
expand_more
Publication Year
2022
Language
English
Research Group
Multimedia Computing
Volume number
2022-September
Pages (from-to)
1111-1115
Event
23rd Annual Conference of the International Speech Communication Association, INTERSPEECH 2022 (2022-09-18 - 2022-09-22), Incheon, Korea, Republic of
Downloads counter
265
Collections
Institutional Repository
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

In this paper, we describe and release publicly the audio-visual wake word spotting (WWS) database in the MISP2021 Challenge, which covers a range of scenarios of audio and video data collected by near-, mid-, and far-field microphone arrays, and cameras, to create a shared and publicly available database for WWS. The database and the code 2 are released, which will be a valuable addition to the community for promoting WWS research using multi-modality information in realistic and complex conditions. Moreover, we investigated the different data augmentation methods for single modalities on an end-to-end WWS network. A set of audio-visual fusion experiments and analysis were conducted to observe the assistance from visual information to acoustic information based on different audio and video field configurations. The results showed that the fusion system generally improves over the single-modality (audio- or video-only) system, especially under complex noisy conditions.

Files

Zhou22g_interspeech.pdf
(pdf | 2.02 Mb)
- Embargo expired in 01-07-2023
License info not available