Unsupervised acoustic unit discovery by leveraging a language-independent subword discriminative feature representation

Conference Paper (2021)
Author(s)

Siyuan Feng (TU Delft - Multimedia Computing)

Piotr Żelasko (Johns Hopkins University)

Laureano Moro Velázquez (Johns Hopkins University)

O.E. Scharenborg (TU Delft - Multimedia Computing)

Research Group
Multimedia Computing
Copyright
© 2021 S. Feng, Piotr Zelasko, Laureano Moro-Velázquez, O.E. Scharenborg
DOI related publication
https://doi.org/10.21437/Interspeech.2021-1664
More Info
expand_more
Publication Year
2021
Language
English
Copyright
© 2021 S. Feng, Piotr Zelasko, Laureano Moro-Velázquez, O.E. Scharenborg
Research Group
Multimedia Computing
Pages (from-to)
1534-1538
ISBN (electronic)
9781713836902
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

This paper tackles automatically discovering phone-like acoustic units (AUD) from unlabeled speech data. Past studies usually proposed single-step approaches. We propose a twostage approach: the first stage learns a subword-discriminative feature representation, and the second stage applies clustering to the learned representation and obtains phone-like clusters as the discovered acoustic units. In the first stage, a recently proposed method in the task of unsupervised subword modeling is improved by replacing a monolingual outof-domain (OOD) ASR system with a multilingual one to create a subword-discriminative representation that is more language-independent. In the second stage, segment-level kmeans is adopted, and two methods to represent the variablelength speech segments as fixed-dimension feature vectors are compared. Experiments on a very low-resource Mboshi language corpus show that our approach outperforms state-of-theart AUD in both normalized mutual information (NMI) and F-score. The multilingual ASR improved upon the monolingual ASR in providing OOD phone labels and in estimating the phone boundaries. A comparison of our systems with and without knowing the ground-truth phone boundaries showed a 16% NMI performance gap, suggesting that the current approach can significantly benefit from improved phone boundary estimation.

Files

Feng21_interspeech.pdf
(pdf | 0.273 Mb)
License info not available