The effectiveness of self-supervised representation learning in zero-resource subword modeling

More Info
expand_more

Abstract

For a language with no transcribed speech available (the zero-resource scenario), conventional acoustic modeling algorithms are not applicable. Recently, zero-resource acoustic modeling has gained much interest. One research problem is unsupervised subword modeling (USM), i.e., learning a feature representation that can distinguish subword units and is robust to speaker variation. Previous studies showed that self-supervised learning (SSL) has the potential to separate speaker and phonetic information in speech in an unsupervised manner, which is highly desired in USM. This paper compares two representative SSL algorithms, namely, contrastive predictive coding (CPC) and autoregressive predictive coding (APC), as a front-end method of a recently proposed, state-of-the art two-stage approach, to learn a representation as input to a back-end cross-lingual DNN. Experiments show that the bottleneck features extracted by the back-end achieved state of the art in a subword ABX task on the Libri-light and ZeroSpeech databases. In general, CPC is more effective than APC as the front-end in our approach, which is independent of the choice of the out-domain language identity in the back-end cross-lingual DNN and the training data amount. With very limited training data, APC is found similar or more effective than CPC when test data consists of long utterances.