How phonotactics affect multilingual and zero-shot asr performance

None, None; None, None; None, None; None, None; None, None; None, None; None, None

How phonotactics affect multilingual and zero-shot asr performance

Conference Paper (2021)

Author(s)

Siyuan Feng (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Piotr Żelasko (Johns Hopkins University)

Laureano Moro-Velázquez (Johns Hopkins University)

Ali Abavisani (University of Illinois at Urbana Champaign)

Mark Hasegawa-Johnson (University of Illinois at Urbana Champaign)

Odette Scharenborg (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Najim Dehak (Johns Hopkins University)

Research Group

Multimedia Computing

Automatic Speech Recognition Multilingual Phonotactics Zero-shot learning

DOI related publication

https://doi.org/10.1109/ICASSP39728.2021.9414478 Final published version

To reference this document use

https://resolver.tudelft.nl/uuid:c035edc6-688f-4d16-aa77-45351266dba2

More Info

expand_more

Publication Year

2021

Language

English

Research Group

Multimedia Computing

Article number

9414478

Pages (from-to)

7238-7242

ISBN (print)

978-1-7281-7606-2

ISBN (electronic)

978-1-7281-7605-5

Event

ICASSP 2021 (2021-06-06 - 2021-06-11), Virtual Conference/Toronto, Canada

Downloads counter

253

Collections

Institutional Repository

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

The idea of combining multiple languages’ recordings to train a single automatic speech recognition (ASR) model brings the promise of the emergence of universal speech representation. Recently, a Transformer encoder-decoder model has been shown to leverage multilingual data well in IPA transcriptions of languages presented during training. However, the representations it learned were not successful in zero-shot transfer to unseen languages. Because that model lacks an explicit factorization of the acoustic model (AM) and language model (LM), it is unclear to what degree the performance suffered from differences in pronunciation or the mismatch in phono-tactics. To gain more insight into the factors limiting zero-shot ASR transfer, we replace the encoder-decoder with a hybrid ASR system consisting of a separate AM and LM. Then, we perform an extensive evaluation of monolingual, multilingual, and crosslingual (zero-shot) acoustic and language models on a set of 13 phonetically diverse languages. We show that the gain from modeling crosslingual phonotactics is limited, and imposing a too strong model can hurt the zero-shot transfer. Furthermore, we find that a multilingual LM hurts a multilingual ASR system’s performance, and retaining only the target language’s phonotactic data in LM training is preferable.

Files

ICASSP2021_discophone.pdf

(pdf | 0.447 Mb)

License info not available