End-to-end language diarization for bilingual code-switching speech

Conference Paper (2021)
Author(s)

Hexin Liu (Nanyang Technological University)

Leibny Paola Garcia Perera (Johns Hopkins University)

Xinyi Zhang (Nanyang Technological University)

J. Dauwels (TU Delft - Signal Processing Systems)

Andy Khong (Nanyang Technological University)

Sanjeev Khudanpur (Johns Hopkins University)

Suzy J. Styles (Nanyang Technological University)

Research Group
Signal Processing Systems
Copyright
© 2021 Hexin Liu, Leibny Paola Garcia Perera, Xinyi Zhang, J.H.G. Dauwels, Andy W.H. Khong, Sanjeev Khudanpur, Suzy J. Styles
DOI related publication
https://doi.org/10.21437/Interspeech.2021-82
More Info
expand_more
Publication Year
2021
Language
English
Copyright
© 2021 Hexin Liu, Leibny Paola Garcia Perera, Xinyi Zhang, J.H.G. Dauwels, Andy W.H. Khong, Sanjeev Khudanpur, Suzy J. Styles
Research Group
Signal Processing Systems
Bibliographical Note
Green Open Access added to TU Delft Institutional Repository ‘You share, we take care!’ – Taverne project https://www.openaccess.nl/en/you-share-we-take-care Otherwise as indicated in the copyright section: the publisher is the copyright holder of this work and the author uses the Dutch legislation to make this work public. @en
Pages (from-to)
866-870
ISBN (electronic)
9781713836902
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

We propose two end-to-end neural configurations for language diarization on bilingual code-switching speech. The first, a BLSTM-E2E architecture, includes a set of stacked bidirectional LSTMs to compute embeddings and incorporates the deep clustering loss to enforce grouping of languages belonging to the same class. The second, an XSA-E2E architecture, is based on an x-vector model followed by a self-attention encoder. The former encodes frame-level features into segmentlevel embeddings while the latter considers all those embeddings to generate a sequence of segment-level language labels. We evaluated the proposed methods on the dataset obtained from the shared task B in WSTCSMC 2020 and our handcrafted simulated data from the SEAME dataset. Experimental results show that our proposed XSA-E2E architecture achieved a relative improvement of 12.1% in equal error rate and a 7.4% relative improvement on accuracy compared with the baseline algorithm in the WSTCSMC 2020 dataset. Our proposed XSA-E2E architecture achieved an accuracy of 89.84% with a baseline of 85.60% on the simulated data derived from the SEAME dataset.

Files

Liu21d_interspeech_1.pdf
(pdf | 0.654 Mb)
- Embargo expired in 01-05-2022
License info not available