The First Multimodal Information Based Speech Processing (Misp) Challenge
Data, Tasks, Baselines And Results
Hang Chen (University of Science and Technology of China)
Hengshun Zhou (University of Science and Technology of China)
Jun Du (University of Science and Technology of China)
Chin-Hui Lee (Georgia Institute of Technology)
Jingdong Chen (Northwestern Polytechnical University)
Shinji Watanabe (Carnegie Mellon University)
Sabato Marco Siniscalchi (University of Enna Kore, Georgia Institute of Technology)
Odette Scharenborg (TU Delft - Multimedia Computing)
Di-Yuan Liu (iFlytek)
undefined More Authors (External organisation)
More Info
expand_more
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.
Abstract
In this paper we discuss the rational of the Multi-model Information based Speech Processing (MISP) Challenge, and provide a detailed description of the data recorded, the two evaluation tasks and the corresponding baselines, followed by a summary of submitted systems and evaluation results. The MISP Challenge aims at tack-ling speech processing tasks in different scenarios by introducing information about an additional modality (e.g., video, or text), which will hopefully lead to better environmental and speaker robustness in realistic applications. In the first MISP challenge, two bench-mark datasets recorded in a real-home TV room with two reproducible open-source baseline systems have been released to promote research in audio-visual wake word spotting (AVWWS) and audio-visual speech recognition (AVSR). To our knowledge, MISP is the first open evaluation challenge to tackle real-world issues of AVWWS and AVSR in the home TV scenario.