More than a feeling?

Reliability and robustness of high-level music classifiers

Master thesis (2020)

Authors

C. Mostert Electrical Engineering, Mathematics and Computer Science

Contributors

C.C.S. Liem (supervisor 1)

A. Hanjalic Intelligent Systems (supervisor 2)

A. Panichella Software Engineering - (supervisor 2)

Faculty

Electrical Engineering, Mathematics and Computer Science

Machine Learning Data Science Music Information Retrieval Robustness Reliability

More Info

expand_more

To reference this document use:

http://resolver.tudelft.nl/uuid:20206cbd-8d50-464b-94c1-37ad5f0c0d8a

Published Date

20-08-2020

Language

English

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Faculty

Electrical Engineering, Mathematics and Computer Science

Abstract

High-level music classification tasks such as automatic music mood annotation impose several challenges, both from a psychological and a machine learning point of view. Ground truth labels for these tasks at hand are hard to define due to the abstract and aesthetic nature of the data, being largely dependent on human psychology and perception. Such labels, however, are required in training and validation sets when traditional machine learning methods are used. Furthermore, due to copyright restrictions which prevent the sharing of commercial music audio, such classifiers have to work with pre-computed music audio features which are known to be somewhat unstable. Due to the challenges inherent to high-level music classification, the following questions arise: do high-level music classifiers actually perform as well as we believe? And can we trust their output to be a solid foundation for future research? This work analyzes the performance of high-level music classifiers using metrics based on label stability, label agreement and distributional differences, all of which do not depend on any problematic ground truth labels, on a dataset which combines data from AcousticBrainz and Spotify. Unexpected patterns in classifier outputs are uncovered, indicating that these outputs should not be taken as absolute truth and do not form a solid foundation for further research. The improvement of these high-level music classifiers is a multidisciplinary effort for which better evaluation methods are required. To this end, several approaches for more comprehensive classifier testing are presented, based on best practices in psychology and software testing. These approaches are not constrained to the field of Music Information Retrieval and can be applied to evaluate classifiers in other domains as well.

Files

Thesis_Chris_Mostert_4473353.p... (.pdf)

(.pdf | 17.7 Mb)