Audio-Visual Wake Word Spotting in MISP2021 Challenge: Dataset Release and Deep Analysis

Zhou, Hengshun; Du, Jun; Zou, Gongzhen; Nian, Zhaoxu; Lee, Chin Hui; Siniscalchi, Sabato Marco; Watanabe, Shinji; Scharenborg, O.E.; Chen, Jingdong

doi:10.21437/Interspeech.2022-10650

Audio-Visual Wake Word Spotting in MISP2021 Challenge

Title

Audio-Visual Wake Word Spotting in MISP2021 Challenge: Dataset Release and Deep Analysis

Author

Zhou, Hengshun (University of Science and Technology of China)
Du, Jun (University of Science and Technology of China)
Zou, Gongzhen (University of Science and Technology of China)
Nian, Zhaoxu (University of Science and Technology of China)
Lee, Chin Hui (Georgia Institute of Technology)
Siniscalchi, Sabato Marco (Georgia Institute of Technology; University of Enna Kore)
Watanabe, Shinji (Carnegie Mellon University)
Scharenborg, O.E. (TU Delft Multimedia Computing)
Chen, Jingdong (Northwestern Polytechnical University)

Date

2022

Abstract

In this paper, we describe and release publicly the audio-visual wake word spotting (WWS) database in the MISP2021 Challenge, which covers a range of scenarios of audio and video data collected by near-, mid-, and far-field microphone arrays, and cameras, to create a shared and publicly available database for WWS. The database and the code ² are released, which will be a valuable addition to the community for promoting WWS research using multi-modality information in realistic and complex conditions. Moreover, we investigated the different data augmentation methods for single modalities on an end-to-end WWS network. A set of audio-visual fusion experiments and analysis were conducted to observe the assistance from visual information to acoustic information based on different audio and video field configurations. The results showed that the fusion system generally improves over the single-modality (audio- or video-only) system, especially under complex noisy conditions.

Subject

analysis
audio-visual database
data augmentation
Wake word spotting

To reference this document use:

http://resolver.tudelft.nl/uuid:74fc0816-0901-423d-b917-365574bf24a3

DOI

https://doi.org/10.21437/Interspeech.2022-10650

Embargo date

2023-07-01

ISSN

2308-457X

Source

Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2022-September, 1111-1115

Event

23rd Annual Conference of the International Speech Communication Association, INTERSPEECH 2022, 2022-09-18 → 2022-09-22, Incheon, Korea, Republic of

Bibliographical note

Green Open Access added to TU Delft Institutional Repository 'You share, we take care!' - Taverne project https://www.openaccess.nl/en/you-share-we-take-care Otherwise as indicated in the copyright section: the publisher is the copyright holder of this work and the author uses the Dutch legislation to make this work public.

Part of collection

Institutional Repository

Document type

journal article

Rights

Files

PDF

zhou22g_interspeech.pdf

2.02 MB

Close viewer