In this thesis, we analyzed and compared speech representations extracted from different frozen self-supervised learning (SSL) speech pre-trained models on their ability to capture articulatory feature (AF) information and their subsequent prediction of phone recognition performa
...
In this thesis, we analyzed and compared speech representations extracted from different frozen self-supervised learning (SSL) speech pre-trained models on their ability to capture articulatory feature (AF) information and their subsequent prediction of phone recognition performance in within-language and cross-language scenarios. Specifically, we compared speech representations of three SSL speech pre-trained models, including CPC, wav2vec 2.0, and HuBert. Firstly, frame-level AF probing tasks were implemented to analyze AF information captured by different speech representations. Subsequently, phone-level ASR systems were implemented to analyze the phoneme recognition performance of these speech representations. Results showed that the performance of the frame-level AF probing task and the accuracy of the phoneme recognition task were correlated. Compared to the conventional speech representation MFCC, all SSL pre-trained speech representations captured more AF information and achieved better phoneme recognition performance in within-language and cross-language scenarios, with HuBert performing best. Meanwhile, the frame-level AF probing task is a good predictor of phoneme recognition performance, showing the importance of capturing AF information in speech representations. Compared with MFCC, in the within-language scenario, the performance of these SSL speech pre-trained models on AF probing tasks achieved a maximum relative increase of 34.4%, and it resulted in the lowest PER of 10.2%. In the cross-language scenario, the maximum relative increase of 26.7% resulted in the lowest PER of 23.0%.