Unsupervised Subword Modeling Using Autoregressive Pretraining and Cross-Lingual Phone-Aware Modeling

This study addresses unsupervised subword modeling, i.e., learning feature representations that can distinguish subword units of a language. The proposed approach adopts a two-stage bottleneck feature (BNF) learning framework, consisting of autoregressive predictive coding (APC) as a front-end and a DNN-BNF model as a back-end. APC pretrained features are set as input features to a DNN-BNF model. A language-mismatched ASR system is used to provide cross-lingual phone labels for DNN-BNF model training. Finally, BNFs are extracted as the subword-discriminative feature representation. A second aim of this work is to investigate the robustness of our approach's effectiveness to different amounts of training data. The results on Libri-light and the ZeroSpeech 2017 databases show that APC is effective in front-end feature pretraining. Our whole system outperforms the state of the art on both databases. Cross-lingual phone labels for English data by a Dutch ASR outperform those by a Mandarin ASR, possibly linked to the larger similarity of Dutch compared to Mandarin with English. Our system is less sensitive to training data amount when the training data is over 50 hours. APC pretraining leads to a reduction of needed training material from over 5,000 hours to around 200 hours with little performance degradation.


Introduction
Training a DNN acoustic model (AM) for a high-performance automatic speech recognition (ASR) system requires a huge amount of speech data paired with transcriptions. Many languages in the world have very limited or even no transcribed data [1]. Conventional supervised acoustic modeling techniques are thus problematic or even not applicable to these languages.
Unsupervised acoustic modeling (UAM) refers to the task of modeling basic acoustic units of a language with only untranscribed speech [2][3][4][5][6][7]. An important task in UAM is to learn frame-level feature representations that can distinguish subword units of the language for which no transcriptions are available, i.e., the target language, and is robust to non-linguistic factors, such as speaker change [1,8]. This problem is referred to as unsupervised subword modeling, and is the focus of this study. It is essentially a feature representation learning problem.
There are many interesting attempts to unsupervised subword modeling [2,3,6,[9][10][11][12]. One research strand is to use purely unsupervised learning techniques [2,3,9]. For instance, Chen et al. [2] proposed a Dirichlet process Gaussian mixture model (DPGMM) posteriorgram approach, which performed the best in ZeroSpeech 2015 [8]. Heck et al. extended this approach by applying unsupervised speaker adaptation, which performed the best in ZeroSpeech 2017 [3]. In a recent study [13], a two-stage bottleneck feature (BNF) learning framework was proposed. The first stage, i.e., the front-end, used the factorized hierarchical variational autoencoder (FHVAE) [14] to learn speaker-invariant features. The second stage, the back-end, consisted of a DNN-BNF model [15], which used the FHVAE pretrained features as input features and generated BNFs as the desired subword-discriminative acoustic feature representations. In the case of unsupervised acoustic modeling, no frame labels are available for DNN-BNF model training. In [13], DPGMM was adopted as a building block of the back-end to generate pseudo-phone labels for the speech frames. In another recent study [9], the vector quantized VAE (VQ-VAE) [16] was applied to directly learn the desired feature representation without a back-end model such as the DNN-BNF, and is comparable to state-of-the-art performance [3].
In another research strand, frame-level feature representations that can distinguish subword units in the target language are created using a cross-lingual knowledge transfer approach [10,11]. Here, out-of-domain (OOD) mismatched language resources are used to train DNN AMs which are further used to extract phone posteriorgrams or BNFs of the target speech. The two research strands mentioned above can also be combined. For instance, [11] proposed to apply the DNN-BNF model, and utilized unsupervised DPGMM and OOD ASR systems to generate two types of frame labels for multi-task DNN-BNF learning. The two label types correspond to the two research strands respectively. The results showed the complementarity of the two label types in unsupervised subword modeling.
The present study adopts a two-stage BNF learning framework similar to [13], and aims at combining unsupervised learning techniques, specifically autoregressive predictive coding (APC) as a front-end, with cross-lingual knowledge transfer in the back-end. Recently, APC has been shown [17] to learn speech feature representations that are beneficial to various downstream tasks, and outperform other effective unsupervised methods such as contrastive predictive coding (CPC) [18] in ASR, speech translation and speaker verification [19]. APC preserves phonetic (subword) and speaker information from the original speech signal, while the two information types are more separable. This makes APC a possibly interesting method for unsupervised subword modeling. In this paper, we investigate the effectiveness of APC in this task for the first time.
In the second stage, a DNN-BNF back-end is trained, using the APC pretrained features as input features. Frame labels required for DNN-BNF model training are obtained using an OOD ASR system as was done in [11]. By doing so, cross-lingual phonetic knowledge is exploited. Two OOD ASR systems trained on different OOD languages are employed for comparison, in order to study the effect of target and OOD language similarity on the performance of the proposed approach.
For low-resource languages for which transcribed data are absent, even unlabeled speech can be costly to collect. The robustness of unsupervised subword modeling methods against  Figure 1: General framework of the proposed approach to unsupervised subword modeling.
limited amounts of training material is therefore an important topic, however has received little attention in the literature so far. The second aim of this work is therefore to systematically investigate the robustness of the proposed approach's effectiveness to different amounts of training data. Specifically, we varied the amount of training data from 10 hours to over 500 hours.

Proposed approach
The general framework of our proposed approach is illustrated in Figure 1. Given untranscribed speech data of a target language, an APC model is pretrained in the front-end. Next, an OOD ASR system trained on a language different from the target language assigns a phone label to every frame of the target language's speech data through decoding. Pretrained features created by the APC model and the cross-lingual phone labels created by the OOD ASR are then used to train a DNN-BNF model in the back-end, from which BNFs are extracted as the subword-discriminative representation in the final step. Front-end APC pretraining will be compared with an FH-VAE approach [14] which was used in related previous work [13]. The whole pipeline of our approach will be compared with a system consisting of only the back-end DNN-BNF model, and a CPC approach [18] applied in the same task [20]. Moreover, two different languages will be used to train two different OOD ASR systems for comparison.

APC pretraining
In our concerned task, previously adopted feature learning methods usually target suppressing speaker variation, such as FHVAE [13] and speaker adaptation [3]. In contrast, APC aims at learning a representation that keeps information from speech, while phonetic information is made more separable from speaker information. The learned representation is considered less risky of losing phonetic information than representations learned by methods in [3,13].
Let us assume a set of unlabeled speech frames {x1, x2, . . . , x T } for training, where T is the total number of frames. At each time step t, the encoder of APC model Enc(·) reads as input a feature vector xt, and outputs a feature vectorxt (same dimension as xt) based on all the previous inputs (1) The goal of APC is to letxt be as close as possible to x t+n , where n is a pre-defined constant positive integer, denoted as prediction step. The loss function during APC training is defined as: Loss = T −n t=1 |xt − x t+n |. Intuitively, increasing n encourages the encoder to capture contextual dependencies in speech, while a small n focuses more on local smoothness.
Here, the encoder of APC Enc(·) is realized by a long short-term memory (LSTM) [21] RNN. Let L denote the num- ber of LSTM layers, Equation (1) is formulated as, where W is a trainable projection matrix. The equations that form LSTM(·) can be found in [22]. After APC training, the output of the top hidden layer h L is extracted as the learned acoustic representation, and is henceforth referred to as the APC feature. Although in principle, h l of any layer l could be used as the learned representation, we follow [17] in using the output of the top layer as they showed that this gave the best results in phone classification tasks.

Cross-lingual phone-aware DNN-BNF
As shown in Figure 1, the DNN-BNF back-end is a DNN with a bottleneck layer in the middle [23]. To obtain cross-lingual phone labels, the OOD ASR is used to decode target speech utterances into lattices, and find the best path for every utterance. Afterwards, each speech frame is assigned with a triphone HMM state modeled by the OOD ASR. These state labels provide phonetic representation for the target speech from a crosslingual perspective.
After obtaining triphone HMM state labels as cross-lingual phone labels, the DNN-BNF is trained using the pretrained APC features and the cross-lingual phone labels in a supervised manner [24], and used to extract BNFs as the desired subworddiscriminative feature representation.

Databases and evaluation metric
English is chosen as the target language while Dutch and Mandarin are chosen as the two OOD languages. Training data for APC pretraining and DNN-BNF model training are taken from Libri-light [20], a newly published English database to support unsupervised subword modeling. The unlab-600 and unlab-6K sets from Libri-light are adopted. Unlab-600 is used in both APC pretraining and DNN-BNF model training, while unlab-6K is used only in DNN-BNF model training. Unlab-600 consists of 526 hours of speech excluding silence. Additionally, we randomly select four subsets of utterances from unlab-600 to investigate the robustness of our approach to different amounts of training material. These subsets consist of 900 (i.e., 13 hours), 3.6K (52 hours), 7.2K (104 hours), and 14.4K (209 hours) utterances. Unlab-6K set consists of 5, 273 hours of speech excluding silence. Details of the training sets are listed in Table 1.
The Dutch and Mandarin corpora used for training the two OOD ASR systems are the CGN [25] and Aidatatang 200zh [26], respectively. The CGN training and test data partition follows [27]. Its training data contains 483 hours of speech, covering speaking styles including conversational and read speech and broadcast news. Aidatatang 200zh is a read speech corpus. Its training data contains 140 hours of speech.
Evaluation data are taken from Libri-light and ZeroSpeech 2017 [1]. Libri-light evaluation sets consist of dev-clean, dev-other, test-clean and test-other. , with * -clean having higher recording quality and accents closer to US English than * -other [28]. They are used to evaluate the effectiveness of both frontend pretrained features and BNFs learned by the back-end.
Evaluation on ZeroSpeech 2017 aims to better compare our approach with previous research in this area. English evaluation data from ZeroSpeech 2017 are used to evaluate the effectiveness of BNFs learned by the the back-end. These data are organized into subsets of differing lengths (1s, 10s & 120s) [1].
The created BNFs, as well as APC pretrained features, are evaluated in terms of the ABX subword discriminability [1]. In the ABX task, A, B and X are three speech segments, and x and y are two different phonemes. A ∈ x, B ∈ y, X ∈ x or y. Following [1] (see also for more details), an error occurs if given a pre-defined distance measure d, d(A, X) > d(B, X), given X ∈ x, or d(A, X) < d(B, X), given X ∈ y. Dynamic time warping is chosen as the distance measure. Segments A and B belong to the same speaker. ABX error rates for withinspeaker and across-speaker are evaluated separately, depending on whether X and A belong to the same speaker.

Front-end
The APC model is implemented as a multi-layer LSTM network. Residual connections are made between two consecutive layers. Each LSTM layer has 100 dimensions. Unless specified explicitly, the number of LSTM layers is 3. For each training data amount setting, the prediction step n (in Section 2.1) is picked from {1, 2, 3, 4, 5} which gives the best ABX performance. Our preliminary experiments showed that increasing n to larger than 5 would lead to rapid degradation in ABX error rate. The input features to APC are 13-dimension MFCCs with cepstral mean normalization (CMN) at speaker level. The model is trained with the open-source tool by [17] for 100 epochs with the Adam optimizer [29], an initial learning rate of 10 −4 , and a batch size of 32. After training, the top LSTM layer's output is extracted as the APC feature representation.
The performance of front-end APC pretraining is compared against FHVAE [14], which was used in related previous work [13]. The latent representation z1 of FHVAE is known to be preserving linguistic content while suppressing speaker variation [14], and is compared with the APC feature representation. The model architecture of FHVAE and its training procedure follow those in [13]. The FHVAE models are trained using an open-source tool [14], and take the same input features and training data (i.e., Libri-light) as the APC models. After training, the FHVAE encoder's output z1 is extracted.

OOD ASR systems
We trained two OOD ASR systems, i.e., a Dutch ASR and a Mandarin ASR. The OOD ASR systems use a chain-time delay NN (TDNN) AM [30] trained using Kaldi [31], containing 7 layers. The TDNN is trained based on the lattice-free maximum mutual information (LF-MMI) criterion [30]. For Dutch, the input features consist of 40-dimension high-resolution (HR) MFCCs. For Mandarin, the input features consist of HR MFCCs appended by pitch features [32]. Frame labels required for TDNN training are obtained by forced-alignment with a GMM-HMM AM trained beforehand. For both systems, a trigram LM is trained using training data transcriptions.
The Dutch ASR obtained a word error rate (WER) of 8.98% on the CGN broadcast test set. (This WER could be improved upon by integrating an RNN LM. However, as Dutch ASR performance is not the focus of this study, an RNN LM is not applied.) The Mandarin ASR obtained a character error rate (CER) of 6.37% on the Aidatatang 200zh test set. The two ASR systems are used to generate cross-lingual phone labels for Libri-light training speech frames.

DNN-BNF setup
Two DNN-BNF models are trained, one taking the Dutch crosslingual phone labels as training labels and one taking the Mandarin phone labels as training labels.
The DNN-BNF consists of 7 feed-forward layers (FFLs). Each layer has 450 dimensions except a 40-dimension bottleneck layer, which is located below the top FFL. The DNN-BNF uses a chain model [30] which is trained based on the LF-MMI criterion. The inputs to DNN-BNF are the APC feature with its neighboring (−3 to +3) frames. After DNN-BNF training, 40-dimension BNFs are extracted as the learned subworddiscriminative representation and evaluated with the ABX task.
For the purpose of comparison, two more DNN-BNF models are trained using the 40-dimension HR MFCC with its neighboring (−3 to +3) frames as input features. One model takes the Dutch labels and the other takes the Mandarin labels. Other training and model settings are unchanged. After training, BNFs are extracted and also evaluated with the ABX task.

Effectiveness of APC features
In this subsection, the APC features and FHVAE features in the front-end (z1) are directly evaluated using the ABX task, without being modeled by the DNN-BNF back-end. ABX error rates (%) of the APC and FHVAE features with respect to different hours of training data are shown in Figure 2. ABX results in this figure are averaged values over the 4 evaluation sets in Libri-light. The official MFCC baseline [20] is also shown in this figure. It can be observed that both the APC features and the FHVAE features outperform the MFCC features. The APC features are consistently superior to the FHVAE features in both the across-and the within-speaker conditions irrespective of the amount of training data. Figure 2 (left) indicates that the APC features are more robust to speaker variation than the FHVAE features, even though the APC model is not explicitly suppressing speaker variation as FHVAE is.

Effectiveness of BNF representation
In this subsection, all models are trained with unlab-600 (526 hours). ABX error rates (%) of BNFs extracted by the backend DNN-BNF model are listed in Table 2. The second and third columns denote input feature types and frame labels for training DNN-BNF models. 'Du' and 'Ma' stand for Dutch and Mandarin. Two front-end features, i.e. APC and CPC (in [20]),  are also listed as references. From this table, it is observed that: (1) DNN-BNF trained with APC features performs better than that trained with MFCC features in all the evaluation sets. This demonstrates the effectiveness of front-end APC pretraining in our proposed two-stage system framework.
(2) The BNFs obtained from the back-end DNN-BNF model outperform the APC features from the front-end. In other words, the results show that back-end DNN-BNF modeling with cross-lingual phone labels outperforms front-end pretrained features for unsupervised subword modeling, similar to what has been observed by [10,11]. BNF also performs better than the CPC feature [20]. Note that CPC does not require OOD resources during training while BNF in this study does.
(3) The performance achieved by adopting Dutch labels in DNN-BNF model training is slightly better than that by adopting Mandarin labels. This can possibly be explained by the similarity between the OOD language and target in-domain language, i.e., Dutch and English, respectively, which are both West Germanic languages, while Mandarin is not. Although one could possibly attribute the superiority of adopting Dutch labels over Mandarin labels to the larger amount of training data for Dutch (483 hours) than for Mandarin (140 hours), this is not a likely explanation because both models achieved fairly similar results on their respective in-domain test sets (in Section 3.3.1). Table 2 also shows CPC outperforms APC. We plan to replace the front-end APC with CPC and study its efficacy in combination with the back-end DNN-BNF model in the future.

Effect of amount of training data
ABX error rates (%) of BNFs extracted by DNN-BNF models with respect to different amounts of training data in hours are illustrated in Figure 3. The results are averaged values over the 4 evaluation sets in Libri-light. Unlab-6K (5, 273 hours) is only adopted in training DNN-BNF models with MFCC input features (marked as " * "). For models trained with APC features as input features (" "), the data amount for APC pretraining and DNN-BNF model training is the same for each run. From Figure 3, it can clearly be seen that performance improves as more training data is available, with the largest improvement when the training data increases from 13 hours to 52 hours, and less improvement for any additional training material. Secondly, across the different data amounts, the DNN-BNF models trained with APC features as input features are almost consistently better than those with MFCC input features. Interestingly, with Dutch labels, the model that uses APC features and is trained with 209 hours of data achieves a similar across-speaker error rate (8.78%) to the model trained with MFCCs with 5, 273 hours of data (8.70%). This implies that APC pretraining "saves" around 5, 000 hours (i.e. 96%) of training data, making APC pretraining highly appealing in lowresource speech modeling. The effect of pretraining on the needed amount of training data is even larger when Mandarin labels are used (saving over 99% of the training data).

ZeroSpeech 2017 results
We also evaluated the performance of our approach on the Ze-roSpeech 2017 English evaluation sets. The results are shown in Table 3, which also includes the official topline [1] and the best-performing system (using OOD data) [10]. Note that, unlike our approach, these two systems employed English labeled data. The total amount of labeled training data used in [10] is 1, 327 hours (including 80-hour English data). In this table, "Proposed-Du/-Ma" denotes our proposed approach by adopting Dutch or Mandarin labels respectively. Interestingly, using Dutch labels, our system trained with 526 hours of data outperforms the topline and [10] systems, and is comparable to the two reference systems when trained with only 104 hours of data. Table 3 also shows the proposed approach by adopting Dutch labels performs better than that by adopting Mandarin labels, which is consistent with observations in Section 4.2.

Conclusions
This study addresses unsupervised subword modeling, and proposes a two-stage system that consists of APC pretraining and cross-lingual phone-aware DNN-BNF modeling. Experimental results on Libri-light and ZeroSpeech 2017 databases demonstrate the effectiveness of APC in front-end feature pretraining. It surpasses a previously adopted FHVAE approach. Our whole system outperforms the state of the art on both databases. Cross-lingual phone labeling for English data by a Dutch ASR is slightly better than by a Mandarin ASR. This is possibly linked to the larger similarity of Dutch than Mandarin with English. The proposed approach benefits from increasing training data amount, and is less sensitive to data amount when the training data is over 50 hours. When using APC pretraining, 4% of the training material could result in a similar performance to using the full training set without APC pretraining.