Understanding Context Effects in the Evaluation of Music Similarity

More Info
expand_more

Abstract

This work analyses context effect in the evaluation of music similarity performed by human annotators to better understand the impact of context effects in the current annotation protocol of Music Information Retrieval Evaluation eXchange (MIREX). Human annotators are known to be subjective when giving similarity judgements. The Audio Music Similarity task in MIREX uses human annotators to collect similarity judgements. The annotator gives judgements to a list of candidate songs that are similar according to the participating system. The annotation protocol has no clear guidelines, and on top of that, literature shows psychological effects which can influence the similarity score. Studies show that disagreement exists between different annotators in the Audio Music Similarity task. It is argued that the disagreement is due to the natural subjectivity of human annotators, but how much of the subjectivity is natural?

In this work, context effects are explored, which are the over- or underrating of candidate songs due to specific properties of the annotated list of candidates. The properties of the list of candidates are called factors and will be used as dependent variables. The exploration of context effects is split into two parts, 1) recognizing context effects and 2) measuring the impact of the context effect. New similarity judgements are collected through crowdsourcing, this data is checked on reliability before analysing the context effects. For recognizing context effects, the changes of previous judgements made by annotators are taken as a metric to see if the annotators are noticing potential context effects. The second part is measuring the magnitude of the over- or underrating by looking at the distance of the set of judgements to the ground truth. Hypotheses are made for the dependant variables change and distance, based on the factors Order, Trend, Location, Spread and Outlier. It seems that the collected data shows signs of context effects, with the Trend and Outlier hypotheses being in line with the data. The Order hypothesis seems to be the opposite of the data. When changes are made by an annotator, the final scores of the judgements are closer to the ground truth than before the changes. However, throughout the work, no significant results are found related to context effects.