Sparse Temporal Convolutional Neural Networks for Keyword Spotting

Master Thesis (2024)
Author(s)

P. Fu (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

C. Gao – Mentor (TU Delft - Electronics)

Sijun Du – Graduation committee member (TU Delft - Electronic Instrumentation)

Faculty
Electrical Engineering, Mathematics and Computer Science
Copyright
© 2024 Peng Fu
More Info
expand_more
Publication Year
2024
Language
English
Copyright
© 2024 Peng Fu
Graduation Date
08-01-2024
Awarding Institution
Delft University of Technology
Programme
Electrical Engineering
Faculty
Electrical Engineering, Mathematics and Computer Science
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Keyword spotting (KWS) is an essential component of voice recognition services on smart devices. Its always-on characteristic requires high accuracy and real-time response. Also, low power consumption is another key demand for KWS devices. In previous research, neural networks have become popular for KWS tasks for their accuracy compared to traditional machine learning technologies. Among classical neural networks like recurrent neural networks (RNNs) and convolutional neural networks (CNNs), temporal convolutional networks (TCNs) have begun to catch attention recently. Moreover, studies related to sparsity are always an efficient method to deal with the growing model size issue for modern neural network designs. As a potential solution, in this work, a TCN model is trained for KWS on the Google Speech Command V2 dataset and achieves an accuracy of 94.1\%. Based on that, two different sparsity are applied to the TCN model. One is temporal sparsity. By creating a Delta convolution layer, the Delta temporal convolutional network (DeltaTCN) achieves an accuracy of 93.6\% with a 72\% reduction in floating-point operations (FLOPS) compared to the original TCN model. Another is structural weight sparsity. By creating sparsity on the weight matrix of each convolution layer, the structural sparse temporal convolutional network (SSPTCN) achieves 93.6\% accuracy with a 70\% reduction in FLOPs and a 39\% reduction in parameters.

Files

License info not available