Analyzing and comparing different self-supervised learning speech pre-trained models in the view of phonetics

None, None

Analyzing and comparing different self-supervised learning speech pre-trained models in the view of phonetics

Master Thesis (2022)

Author(s)

H. Ji (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

Odette Scharenborg – Mentor (TU Delft - Multimedia Computing)

Faculty

Electrical Engineering, Mathematics and Computer Science

Copyright

Speech recognition Self-supervised learning Articulatory features Pre-trained speech representations Cross-lingual

To reference this document use:

https://resolver.tudelft.nl/uuid:9ef2894f-55ae-4740-a487-bde310fc9159

More Info

expand_more

Publication Year

2022

Language

English

Copyright

Graduation Date

28-06-2022

Awarding Institution

Delft University of Technology

Programme

['Electrical Engineering | Embedded Systems']

Abstract

In this thesis, we analyzed and compared speech representations extracted from different frozen self-supervised learning (SSL) speech pre-trained models on their ability to capture articulatory feature (AF) information and their subsequent prediction of phone recognition performance in within-language and cross-language scenarios. Specifically, we compared speech representations of three SSL speech pre-trained models, including CPC, wav2vec 2.0, and HuBert. Firstly, frame-level AF probing tasks were implemented to analyze AF information captured by different speech representations. Subsequently, phone-level ASR systems were implemented to analyze the phoneme recognition performance of these speech representations. Results showed that the performance of the frame-level AF probing task and the accuracy of the phoneme recognition task were correlated. Compared to the conventional speech representation MFCC, all SSL pre-trained speech representations captured more AF information and achieved better phoneme recognition performance in within-language and cross-language scenarios, with HuBert performing best. Meanwhile, the frame-level AF probing task is a good predictor of phoneme recognition performance, showing the importance of capturing AF information in speech representations. Compared with MFCC, in the within-language scenario, the performance of these SSL speech pre-trained models on AF probing tasks achieved a maximum relative increase of 34.4%, and it resulted in the lowest PER of 10.2%. In the cross-language scenario, the maximum relative increase of 26.7% resulted in the lowest PER of 23.0%.

Files

HangJi_Thesis_Final.pdf

(pdf | 3.25 Mb)

License info not available