Error-Informed Contrastive Learning for Dutch Personalized Dysarthric Phoneme Recognition
B. Koc (TU Delft - Electrical Engineering, Mathematics and Computer Science)
O.E. Scharenborg – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)
Y. Zhang – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)
C.R.M.M. Oertel Genannt Bierbach – Graduation committee member (TU Delft - Electrical Engineering, Mathematics and Computer Science)
More Info
expand_more
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.
Abstract
Automatic speech recognition systems achieve near-human performance under standard conditions but perform poorly on dysarthric speech due to high acoustic variability resulting from neuromotor impairment. While speaker-specific adaptation can improve performance, limited training data restricts conventional learning approaches. Contrastive learning offers a promising alternative by encouraging more discriminative phoneme representations from limited data, but its effectiveness depends strongly on how negative examples are selected. This thesis investigates whether personalized contrastive learning can improve Dutch dysarthric phoneme recognition.
A Whisper-based encoder-DNN-CTC model is extended with a triplet-loss objective to improve phoneme-level discrimination. Four negative sampling strategies are compared: randomly selected, phonologically motivated, and two empirically derived from the model's own prediction errors, one estimated on the training set and one via cross-validation. Each is evaluated under two training regimes: contrastive fine-tuning of a pretrained model and training from scratch.
All contrastive approaches significantly outperform a CTC-only baseline. The strongest results are obtained with phonologically motivated and cross-validation-based empirical negatives when training from scratch, yielding up to a 10.7% relative reduction in phoneme error rate. Under fine-tuning, differences between sampling strategies are negligible. In contrast, when trained from scratch, the phonological and cross-validation-based empirical strategies significantly outperform randomly selected and training-set-based empirical negatives.
These findings suggest that, for this speaker, contrastive learning for dysarthric speech benefits from phonologically informed or empirically derived negative pairs rather than random selection. A practical trade-off emerges between the two strongest strategies: phonologically motivated sampling requires no speaker-specific preprocessing and is immediately applicable to new speakers, but generates a large number of triplets and is computationally expensive at training time. Cross-validation-based empirical sampling requires building a speaker-specific confusion matrix upfront, but produces fewer, more targeted triplets and trains more efficiently. Given comparable performance, the choice between them reduces to whether preprocessing overhead or training-time resources are the limiting constraint.