Hear Me Out

None, None; None, None; None, None; None, None

Hear Me Out

A Study on the Use of the Voice Modality for Crowdsourced Relevance Assessments

Conference Paper (2023)

Author(s)

N. Roy (TU Delft - Web Information Systems)

A.M.A. Balayn (TU Delft - Web Information Systems)

D.M. Maxwell (TU Delft - Web Information Systems)

C Hauff (Spotify, TU Delft - Web Information Systems)

Research Group

Web Information Systems

Copyright

DOI related publication

https://doi.org/10.1145/3539618.3591694

Crowdsourcing User Interfaces Cognitive Ability Data Annotation Relevance Assessment

To reference this document use:

https://resolver.tudelft.nl/uuid:ec734213-f60a-4b53-8397-be8dd9044e08

More Info

expand_more

Publication Year

2023

Language

English

Copyright

Research Group

Web Information Systems

Pages (from-to)

718-728

ISBN (electronic)

9781450394086

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

The creation of relevance assessments by human assessors (often nowadays crowdworkers) is a vital step when building IR test collections. Prior works have investigated assessor quality & behaviour, and tooling to support assessors in their task. We have few insights though into the impact of a document's presentation modality on assessor efficiency and effectiveness. Given the rise of voice-based interfaces, we investigate whether it is feasible for assessors to judge the relevance of text documents via a voice-based interface. We ran a user study (n = 49) on a crowdsourcing platform where participants judged the relevance of short and long documents-sampled from the TREC Deep Learning corpus-presented to them either in the text or voice modality. We found that: (i) participants are equally accurate in their judgements across both the text and voice modality; (ii) with increased document length it takes participants significantly longer (for documents of length > 120 words it takes almost twice as much time) to make relevance judgements in the voice condition; and (iii) the ability of assessors to ignore stimuli that are not relevant (i.e., inhibition) impacts the assessment quality in the voice modality-assessors with higher inhibition are significantly more accurate than those with lower inhibition. Our results indicate that we can reliably leverage the voice modality as a means to effectively collect relevance labels from crowdworkers.

Files

3539618.3591694.pdf

(pdf | 2.53 Mb)