Music Genre Detection

with Neural Networks

More Info
expand_more

Abstract

In this thesis we classify samples of music according to the genre that the music belongs to using neural networks. We divide this task into four parts. In the first part, we prepare the audio files to be used as input to a neural network. Specifically, we examine ways to create spectrograms. We then optimise the spectrograms by reducing and normalising them. The second part consists of theoretical information regarding neural networks. We initially look at the perceptron, the building blocks of any neural network, and then extend this notion to various networks, such as the multilayer perceptron, the recurrent network and the convolutional network. In the third part, we apply the theoretical knowledge that we gained in part two and implement a standard neural network, a recurrent neural network, a convolutional neural network, and a combination of
recurrent and convolutional neural networks. We examine various network structures and we evaluate them based on what the networks can learn, how fast they can learn it and how accurate its classifications are. Simultaneously, we focus on creating an efficient network, using the fewest amounts of computational resources possible. We train each of the networks with data that we created in part one, and compare the performance of the networks with each other. In terms of accuracy and loss measures, we find that the best performing network is the combination of the recurrent and convolutional neural network. This network is able to determine which of six considered genres a 3 second sound sample belongs to with an accuracy of 90%. However, in terms of computational resources required to train the models, the convolutional neural network with many kernels during training converges using least computational cost. We then experiment with the amount of kernels in the convolutional layers, and find that a layer with many kernels learns faster, but does not necessarily yield better results. This is because networks with fewer kernels eventually learn the same kernels that are significant. Finally, we consider the impact of varying sound sample lengths on the performance of the networks. For a 1 second sound sample, we see that the recurrent network outperforms the other networks in terms of accuracy of the predictions. However, the larger we make the sound samples between 1 and 3 seconds, the better each network performs. Part four consists of an explanation of each component in the system, and presents a complete system built with Google’s TensorFlow [15], in which all the components work together to create an end-to-end classification system.