Finding Spoken Identifications

Using GPT-4 Annotation For An Efficient And Fast Dataset Creation Pipeline

Conference Paper (2024)
Author(s)

Maliha Jahan (Johns Hopkins University)

Helin Wang (Johns Hopkins University)

Thomas Thebaud (Johns Hopkins University)

Yinglun Sun (University of Illinois at Urbana Champaign)

Giang Le (University of Illinois at Urbana Champaign)

Zsuzsanna Fagyal (University of Illinois at Urbana Champaign)

O.E. Scharenborg (TU Delft - Multimedia Computing)

Mark Hasegawa-Johnson (University of Illinois at Urbana Champaign)

Laureano Moro-Velazquez (Johns Hopkins University)

Najim Dehak (Johns Hopkins University)

Research Group
Multimedia Computing
More Info
expand_more
Publication Year
2024
Language
English
Research Group
Multimedia Computing
Pages (from-to)
7296-7306
ISBN (electronic)
9782493814104
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

The growing emphasis on fairness in speech-processing tasks requires datasets with speakers from diverse subgroups that allow training and evaluating fair speech technology systems. However, creating such datasets through manual annotation can be costly. To address this challenge, we present a semi-automated dataset creation pipeline that leverages large language models. We use this pipeline to generate a dataset of speakers identifying themself or another speaker as belonging to a particular race, ethnicity, or national origin group. We use OpenaAI's GPT-4 to perform two complex annotation tasks- separating files relevant to our intended dataset from the irrelevant ones (filtering) and finding and extracting information on identifications within a transcript (tagging). By evaluating GPT-4's performance using human annotations as ground truths, we show that it can reduce resources required by dataset annotation while barely losing any important information. For the filtering task, GPT-4 had a very low miss rate of 6.93%. GPT-4's tagging performance showed a trade-off between precision and recall, where the latter got as high as 97%, but precision never exceeded 45%. Our approach reduces the time required for the filtering and tagging tasks by 95% and 80%, respectively. We also present an in-depth error analysis of GPT-4's performance.