More than a feeling?
Reliability and robustness of high-level music classifiers
More Info
expand_more
Abstract
High-level music classification tasks such as automatic music mood annotation impose several challenges, both from a psychological and a machine learning point of view. Ground truth labels for these tasks at hand are hard to define due to the abstract and aesthetic nature of the data, being largely dependent on human psychology and perception. Such labels, however, are required in training and validation sets when traditional machine learning methods are used. Furthermore, due to copyright restrictions which prevent the sharing of commercial music audio, such classifiers have to work with pre-computed music audio features which are known to be somewhat unstable. Due to the challenges inherent to high-level music classification, the following questions arise: do high-level music classifiers actually perform as well as we believe? And can we trust their output to be a solid foundation for future research? This work analyzes the performance of high-level music classifiers using metrics based on label stability, label agreement and distributional differences, all of which do not depend on any problematic ground truth labels, on a dataset which combines data from AcousticBrainz and Spotify. Unexpected patterns in classifier outputs are uncovered, indicating that these outputs should not be taken as absolute truth and do not form a solid foundation for further research. The improvement of these high-level music classifiers is a multidisciplinary effort for which better evaluation methods are required. To this end, several approaches for more comprehensive classifier testing are presented, based on best practices in psychology and software testing. These approaches are not constrained to the field of Music Information Retrieval and can be applied to evaluate classifiers in other domains as well.