More than a feeling?

Reliability and robustness of high-level music classifiers

Master Thesis (2020)
Author(s)

C. Mostert (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

C.C.S. Liem – Mentor (TU Delft - Multimedia Computing)

A. Hanjalic – Graduation committee member (TU Delft - Intelligent Systems)

A. Panichella – Graduation committee member (TU Delft - Software Engineering)

Faculty
Electrical Engineering, Mathematics and Computer Science
More Info
expand_more
Publication Year
2020
Language
English
Graduation Date
20-08-2020
Awarding Institution
Delft University of Technology
Programme
['Computer Science | Data Science and Technology']
Faculty
Electrical Engineering, Mathematics and Computer Science
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

High-level music classification tasks such as automatic music mood annotation impose several challenges, both from a psychological and a machine learning point of view. Ground truth labels for these tasks at hand are hard to define due to the abstract and aesthetic nature of the data, being largely dependent on human psychology and perception. Such labels, however, are required in training and validation sets when traditional machine learning methods are used. Furthermore, due to copyright restrictions which prevent the sharing of commercial music audio, such classifiers have to work with pre-computed music audio features which are known to be somewhat unstable. Due to the challenges inherent to high-level music classification, the following questions arise: do high-level music classifiers actually perform as well as we believe? And can we trust their output to be a solid foundation for future research? This work analyzes the performance of high-level music classifiers using metrics based on label stability, label agreement and distributional differences, all of which do not depend on any problematic ground truth labels, on a dataset which combines data from AcousticBrainz and Spotify. Unexpected patterns in classifier outputs are uncovered, indicating that these outputs should not be taken as absolute truth and do not form a solid foundation for further research. The improvement of these high-level music classifiers is a multidisciplinary effort for which better evaluation methods are required. To this end, several approaches for more comprehensive classifier testing are presented, based on best practices in psychology and software testing. These approaches are not constrained to the field of Music Information Retrieval and can be applied to evaluate classifiers in other domains as well.

Files

License info not available