The First Multimodal Information Based Speech Processing (Misp) Challenge

None, None; None, None; None, None; None, None; None, None; None, None; None, None; None, None; None, None; None, None

The First Multimodal Information Based Speech Processing (Misp) Challenge

Data, Tasks, Baselines And Results

Conference Paper (2022)

Author(s)

Hang Chen (University of Science and Technology of China)

Hengshun Zhou (University of Science and Technology of China)

Jun Du (University of Science and Technology of China)

Chin-Hui Lee (Georgia Institute of Technology)

Jingdong Chen (Northwestern Polytechnical University)

Shinji Watanabe (Carnegie Mellon University)

Sabato Marco Siniscalchi (University of Enna Kore, Georgia Institute of Technology)

Odette Scharenborg (TU Delft - Multimedia Computing)

Di-Yuan Liu (iFlytek)

undefined More Authors (External organisation)

Research Group

Multimedia Computing

Automatic speech recognition MISP challenge Microphone array Audio-visual Wake word spotting

DOI related publication

https://doi.org/10.1109/ICASSP43922.2022.9746683 Final published version

To reference this document use

https://resolver.tudelft.nl/uuid:488cad9d-badf-4818-8cb2-1b28d5d44c01

More Info

expand_more

Publication Year

2022

Language

English

Research Group

Multimedia Computing

Article number

9746683

Pages (from-to)

9266-9270

ISBN (print)

978-1-6654-0541-6

ISBN (electronic)

978-1-6654-0540-9

Event

ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2022-05-23 - 2022-05-27), Singapore, Singapore

Downloads counter

561

Collections

Institutional Repository

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

In this paper we discuss the rational of the Multi-model Information based Speech Processing (MISP) Challenge, and provide a detailed description of the data recorded, the two evaluation tasks and the corresponding baselines, followed by a summary of submitted systems and evaluation results. The MISP Challenge aims at tack-ling speech processing tasks in different scenarios by introducing information about an additional modality (e.g., video, or text), which will hopefully lead to better environmental and speaker robustness in realistic applications. In the first MISP challenge, two bench-mark datasets recorded in a real-home TV room with two reproducible open-source baseline systems have been released to promote research in audio-visual wake word spotting (AVWWS) and audio-visual speech recognition (AVSR). To our knowledge, MISP is the first open evaluation challenge to tackle real-world issues of AVWWS and AVSR in the home TV scenario.

Files

The_First_Multimodal_Informati... (pdf)

(pdf | 0.958 Mb)

- Embargo expired in 01-07-2023

License info not available