Sparse Temporal Convolutional Neural Networks for Keyword Spotting

More Info
expand_more

Abstract

Keyword spotting (KWS) is an essential component of voice recognition services on smart devices. Its always-on characteristic requires high accuracy and real-time response. Also, low power consumption is another key demand for KWS devices. In previous research, neural networks have become popular for KWS tasks for their accuracy compared to traditional machine learning technologies. Among classical neural networks like recurrent neural networks (RNNs) and convolutional neural networks (CNNs), temporal convolutional networks (TCNs) have begun to catch attention recently. Moreover, studies related to sparsity are always an efficient method to deal with the growing model size issue for modern neural network designs. As a potential solution, in this work, a TCN model is trained for KWS on the Google Speech Command V2 dataset and achieves an accuracy of 94.1\%. Based on that, two different sparsity are applied to the TCN model. One is temporal sparsity. By creating a Delta convolution layer, the Delta temporal convolutional network (DeltaTCN) achieves an accuracy of 93.6\% with a 72\% reduction in floating-point operations (FLOPS) compared to the original TCN model. Another is structural weight sparsity. By creating sparsity on the weight matrix of each convolution layer, the structural sparse temporal convolutional network (SSPTCN) achieves 93.6\% accuracy with a 70\% reduction in FLOPs and a 39\% reduction in parameters.