Visualizing Phoneme Category Adaptation in Deep Neural Networks

None, None; None, None; None, None; None, None

Visualizing Phoneme Category Adaptation in Deep Neural Networks

Conference Paper (2018)

Author(s)

O.E. Scharenborg (TU Delft - Multimedia Computing, Radboud Universiteit Nijmegen)

Sebastian Tiesmeyer (Radboud Universiteit Nijmegen)

Mark Hasegawa-Johnson (University of Illinois at Urbana Champaign)

Najim Dehak (Johns Hopkins University)

Multimedia Computing

Copyright

DOI related publication

https://doi.org/10.21437/Interspeech.2018-1707

Visualisation Deep neural networks Phoneme category adaptation Human perceptual learning

To reference this document use:

https://resolver.tudelft.nl/uuid:33a28b86-c5dc-4d14-bfae-08d3dcf2cdbf

More Info

expand_more

Publication Year

2018

Language

English

Copyright

Multimedia Computing

Pages (from-to)

1482-1486

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Both human listeners and machines need to adapt their sound categories whenever a new speaker is encountered. This perceptual learning is driven by lexical information. The aim of this paper is two-fold: investigate whether a deep neural network-based (DNN) ASR system can adapt to only a few examples of ambiguous speech as humans have been found to do; investigate a DNN’s ability to serve as a model of human perceptual learning. Crucially, we do so by looking at intermediate levels of phoneme category adaptation rather than at the output level. We visualize the activations in the hidden layers of the DNN during perceptual learning. The results show that, similar to humans, DNN systems learn speaker-adapted phone category boundaries from a few labeled examples. The DNN adapts its category boundaries not only by adapting the weights of the output layer, but also by adapting the implicit feature maps computed by the hidden layers, suggesting the possibility that human perceptual learning might involve a similar nonlinear distortion of a perceptual space that is intermediate between the acoustic input and the phonological categories. Comparisons between DNNs and humans can thus provide valuable insights into the way humans process speech and improve ASR technology.

Files

Visualization_final_camera_rea... (pdf)

(pdf | 0.42 Mb)

License info not available