Audio-Visual Wake Word Spotting in MISP2021 Challenge

None, None; None, None; None, None; None, None; None, None; None, None; None, None; None, None; None, None; None, None

Audio-Visual Wake Word Spotting in MISP2021 Challenge

Dataset Release and Deep Analysis

Journal Article (2022)

Author(s)

Hengshun Zhou (University of Science and Technology of China)

Jun Du (University of Science and Technology of China)

Gongzhen Zou (University of Science and Technology of China)

Zhaoxu Nian (University of Science and Technology of China)

Chin Hui Lee (Georgia Institute of Technology)

Sabato Marco Siniscalchi (Georgia Institute of Technology, University of Enna Kore)

Shinji Watanabe (Carnegie Mellon University)

Odette Scharenborg (TU Delft - Multimedia Computing)

Jingdong Chen (Northwestern Polytechnical University)

undefined More Authors (External organisation)

Research Group

Multimedia Computing

Analysis Data augmentation Audio-visual database Wake word spotting

DOI related publication

https://doi.org/10.21437/Interspeech.2022-10650 Final published version

To reference this document use

https://resolver.tudelft.nl/uuid:74fc0816-0901-423d-b917-365574bf24a3

More Info

expand_more

Publication Year

2022

Language

English

Research Group

Multimedia Computing

Volume number

2022-September

Pages (from-to)

1111-1115

Event

23rd Annual Conference of the International Speech Communication Association, INTERSPEECH 2022 (2022-09-18 - 2022-09-22), Incheon, Korea, Republic of

Downloads counter

286

Collections

Institutional Repository

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

In this paper, we describe and release publicly the audio-visual wake word spotting (WWS) database in the MISP2021 Challenge, which covers a range of scenarios of audio and video data collected by near-, mid-, and far-field microphone arrays, and cameras, to create a shared and publicly available database for WWS. The database and the code ² are released, which will be a valuable addition to the community for promoting WWS research using multi-modality information in realistic and complex conditions. Moreover, we investigated the different data augmentation methods for single modalities on an end-to-end WWS network. A set of audio-visual fusion experiments and analysis were conducted to observe the assistance from visual information to acoustic information based on different audio and video field configurations. The results showed that the fusion system generally improves over the single-modality (audio- or video-only) system, especially under complex noisy conditions.

Files

Zhou22g_interspeech.pdf

(pdf | 2.02 Mb)

- Embargo expired in 01-07-2023

License info not available