The Multimodal Information Based Speech Processing (Misp) 2022 Challenge

None, None; None, None; None, None; None, None; None, None; None, None; None, None; None, None; None, None; None, None; None, None; None, None

The Multimodal Information Based Speech Processing (Misp) 2022 Challenge

Audio-Visual Diarization And Recognition

Conference Paper (2023)

Author(s)

Zhe Wang (University of Science and Technology of China)

Shilong Wu (University of Science and Technology of China)

Hang Chen (University of Science and Technology of China)

Mao-Kui He (University of Science and Technology of China)

Jun Du (University of Science and Technology of China)

Chin-Hui Lee (Georgia Institute of Technology)

Jingdong Chen (Northwestern Polytechnical University)

Shinji Watanabe (Carnegie Mellon University)

Sabato Marco Siniscalchi (University of Enna Kore, Georgia Institute of Technology)

Odette Scharenborg (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Diyuan Liu (iFlytek)

undefined More Authors (External organisation)

Research Group

Multimedia Computing

Speech recognition Multimodality MISP challenge Speaker diarization

DOI related publication

https://doi.org/10.1109/ICASSP49357.2023.10094836 Final published version

To reference this document use

https://resolver.tudelft.nl/uuid:3e54aaa0-46f8-4411-a5ca-351a314d73ce

More Info

expand_more

Publication Year

2023

Language

English

Research Group

Multimedia Computing

ISBN (print)

978-1-7281-6328-4

ISBN (electronic)

978-1-7281-6327-7

Event

48th IEEE International Conference on Acoustics, Speech and Signal Processing 2023 (2023-06-04 - 2023-06-10), Rhodes Island, Greece

Downloads counter

490

Collections

Institutional Repository

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

The Multi-modal Information based Speech Processing (MISP) challenge aims to extend the application of signal processing technology in specific scenarios by promoting the research into wake-up words, speaker diarization, speech recognition, and other technologies. The MISP2022 challenge has two tracks: 1) audio-visual speaker diarization (AVSD), aiming to solve "who spoken when" using both audio and visual data; 2) a novel audio-visual diarization and recognition (AVDR) task that focuses on addressing "who spoken what when" with audio-visual speaker diarization results. Both tracks focus on the Chinese language, and use far-field audio and video in real home-tv scenarios: 2-6 people communicating each other with TV noise in the background. This paper introduces the dataset, track settings, and baselines of the MISP2022 challenge. Our analyses of experiments and examples indicate the good performance of AVDR baseline system, and the potential difficulties in this challenge due to, e.g., the far-field video quality, the presence of TV noise in the background, and the indistinguishable speakers.

Files

The_Multimodal_Information_Bas... (pdf)

(pdf | 1.19 Mb)

- Embargo expired in 05-11-2023

License info not available