Analyzing and comparing different self-supervised learning speech pre-trained models in the view of phonetics

Master Thesis (2022)
Author(s)

H. Ji (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

Odette Scharenborg – Mentor (TU Delft - Multimedia Computing)

Faculty
Electrical Engineering, Mathematics and Computer Science
Copyright
© 2022 Hang Ji
More Info
expand_more
Publication Year
2022
Language
English
Copyright
© 2022 Hang Ji
Graduation Date
28-06-2022
Awarding Institution
Delft University of Technology
Programme
['Electrical Engineering | Embedded Systems']
Faculty
Electrical Engineering, Mathematics and Computer Science
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

In this thesis, we analyzed and compared speech representations extracted from different frozen self-supervised learning (SSL) speech pre-trained models on their ability to capture articulatory feature (AF) information and their subsequent prediction of phone recognition performance in within-language and cross-language scenarios. Specifically, we compared speech representations of three SSL speech pre-trained models, including CPC, wav2vec 2.0, and HuBert. Firstly, frame-level AF probing tasks were implemented to analyze AF information captured by different speech representations. Subsequently, phone-level ASR systems were implemented to analyze the phoneme recognition performance of these speech representations. Results showed that the performance of the frame-level AF probing task and the accuracy of the phoneme recognition task were correlated. Compared to the conventional speech representation MFCC, all SSL pre-trained speech representations captured more AF information and achieved better phoneme recognition performance in within-language and cross-language scenarios, with HuBert performing best. Meanwhile, the frame-level AF probing task is a good predictor of phoneme recognition performance, showing the importance of capturing AF information in speech representations. Compared with MFCC, in the within-language scenario, the performance of these SSL speech pre-trained models on AF probing tasks achieved a maximum relative increase of 34.4%, and it resulted in the lowest PER of 10.2%. In the cross-language scenario, the maximum relative increase of 26.7% resulted in the lowest PER of 23.0%.

Files

HangJi_Thesis_Final.pdf
(pdf | 3.25 Mb)
License info not available