How Does OpenAI’s Whisper Interpret Dysarthric Speech?

None, None

How Does OpenAI’s Whisper Interpret Dysarthric Speech?

An Analysis of Acoustic Feature Probing and Representation Layers for Dysarthic Speech

Bachelor Thesis (2024)

Author(s)

O. Agaoglu (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

Z. Yue – Mentor (TU Delft - Multimedia Computing)

Y. Zhang – Mentor (TU Delft - Multimedia Computing)

Faculty

Electrical Engineering, Mathematics and Computer Science

Interpretability Whisper Probing Dysarthic Speech Acoustic Features

To reference this document use:

https://resolver.tudelft.nl/uuid:47837feb-1b2e-4bba-9fad-f92d84024abb

More Info

expand_more

Publication Year

2024

Language

English

Graduation Date

27-06-2024

Awarding Institution

Delft University of Technology

Project

['CSE3000 Research Project']

Programme

['Computer Science and Engineering']

Faculty

Electrical Engineering, Mathematics and Computer Science

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

This paper investigates how OpenAI’s Whisper model processes dysarthric speech by probing its internal acoustic feature representations. Utilizing the TORGO database, we analyzed Whisper’s capability to encode significant acoustic features specific to dysarthric speech across its encoding layers. Our findings reveal that initial layers are particularly effective in capturing distinct features, while deeper layers show generalized representations. Despite this, Whisper’s zero-shot performance in distinguishing dysarthric speech severity levels is noteworthy. We employed a series of probing tasks tailored to dysarthric speech characteristics to pinpoint specific features and their transformation across the model’s layers. This study highlights Whisper’s potential in handling atypical speech patterns without fine-tuning, paving the way for further research into the interpretability and application of transformer-based models in medical and assistive technologies. We discuss the implications of these results for enhancing transparency, reliability, and safe AI integration in healthcare.

Files

Ilovepdf_merged.pdf

(pdf | 4.68 Mb)

License info not available