Deep Learning-Based Sound Identification

More Info


Environmental sound identification and recognition aim to detect sound events within an audio clip. This technology is useful in many real-world applications such as security systems, smart vehicle navigation and surveillance of noise pollution, etc. Research on this topic has received increased attention in recent years. Performance is increasing rapidly as a result of deep learning methods. In this project, our goal is to realize urban sound classification using several neural network models. We select log-Mel spectrogram as the audio representation and use two types of neural networks to perform the classification task. The first is the convolutional neural network (CNN), which is the most straightforward and widely used method for a classification problem. The second type of network is autoencoder based models. This type of model includes the variational autoencoder (VAE), beta-VAE and bounded information rate variational autoencoder (BIR-VAE). The encoders of these systems extract a low dimensionality representation. The classification is then performed on this so-called latent representation. Our experiments assess the performances of different models by evaluation metrics. The results show that CNN is the most promising classifier in our case, autoencoder-based models can successfully reconstruct the log-Mel spectrogram and the latent features learned by encoders are meaningful as classification can be achieved.