The First Multimodal Information Based Speech Processing (Misp) Challenge

Data, Tasks, Baselines And Results

Conference Paper (2022)
Author(s)

Hang Chen (University of Science and Technology of China)

Hengshun Zhou (University of Science and Technology of China)

Jun Du (University of Science and Technology of China)

Chin-Hui Lee (Georgia Institute of Technology)

Jingdong Chen (Northwestern Polytechnical University)

Shinji Watanabe (Carnegie Mellon University)

Sabato Marco Siniscalchi (University of Enna Kore, Georgia Institute of Technology)

Odette Scharenborg (TU Delft - Multimedia Computing)

Di-Yuan Liu (iFlytek)

undefined More Authors (External organisation)

Research Group
Multimedia Computing
DOI related publication
https://doi.org/10.1109/ICASSP43922.2022.9746683 Final published version
More Info
expand_more
Publication Year
2022
Language
English
Research Group
Multimedia Computing
Article number
9746683
Pages (from-to)
9266-9270
ISBN (print)
978-1-6654-0541-6
ISBN (electronic)
978-1-6654-0540-9
Event
ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2022-05-23 - 2022-05-27), Singapore, Singapore
Downloads counter
561
Collections
Institutional Repository
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

In this paper we discuss the rational of the Multi-model Information based Speech Processing (MISP) Challenge, and provide a detailed description of the data recorded, the two evaluation tasks and the corresponding baselines, followed by a summary of submitted systems and evaluation results. The MISP Challenge aims at tack-ling speech processing tasks in different scenarios by introducing information about an additional modality (e.g., video, or text), which will hopefully lead to better environmental and speaker robustness in realistic applications. In the first MISP challenge, two bench-mark datasets recorded in a real-home TV room with two reproducible open-source baseline systems have been released to promote research in audio-visual wake word spotting (AVWWS) and audio-visual speech recognition (AVSR). To our knowledge, MISP is the first open evaluation challenge to tackle real-world issues of AVWWS and AVSR in the home TV scenario.

Files

The_First_Multimodal_Informati... (pdf)
(pdf | 0.958 Mb)
- Embargo expired in 01-07-2023
License info not available