Building an ASR System for Mboshi Using A Cross-language Definition of Acoustic Units Approach

None, None; None, None; None, None; None, None; None, None

Building an ASR System for Mboshi Using A Cross-language Definition of Acoustic Units Approach

Conference Paper (2018)

Author(s)

O.E. Scharenborg (TU Delft - Multimedia Computing, Radboud Universiteit Nijmegen)

Patrick Ebel (Radboud Universiteit Nijmegen)

Francesco Ciannella (Carnegie Mellon University)

Mark Hasegawa-Johnson (University of Illinois at Urbana Champaign)

Najim Dehak (Johns Hopkins University)

Research Group

Multimedia Computing

Copyright

DOI related publication

https://doi.org/10.21437/SLTU.2018-35

Low-resource automatic speech recognition Cross-language adaptation N, Semi-supervised training

To reference this document use:

https://resolver.tudelft.nl/uuid:fece6820-ab82-48d8-9cc2-e010a0f645a5

More Info

expand_more

Publication Year

2018

Language

English

Copyright

Research Group

Multimedia Computing

Pages (from-to)

167-171

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

For many languages in the world, not enough (annotated) speech data is available to train an ASR system. Recently, we proposed a cross-language method for training an ASR system using linguistic knowledge and semi-supervised training. Here, we apply this approach to the low-resource language Mboshi. Using an ASR system trained on Dutch, Mboshi acoustic units were first created using cross-language initialization of the phoneme vectors in the output layer. Subsequently, this adapted system was retrained using Mboshi self-labels. Two training methods were investigated: retraining of only the output layer and retraining the full deep neural network (DNN). The resulting Mboshi system was analyzed by investigating per phoneme accuracies, phoneme confusions, and by visualizing the hidden layers of the DNNs prior to and following retraining with the self-labels. Results showed a fairly similar performance for the two training methods but a better phoneme representation for the fully retrained DNN.

Files

Odette.pdf

(pdf | 0.579 Mb)

License info not available