IH
I.N. Huisman
info
Please Note
<p>This page displays the records of the person named above and is not linked to a unique person identifier. This record may need to be merged to a profile.</p>
1 records found
1
Comparing Human Listeners and Dutch ASR on Transcribing Child Speech
The Effect of Familiarity with Child Speech on Transcription Performance
Automatic Speech Recognition (ASR) systems are becoming increasingly common in day-to-day life. Yet, child speech remains challenging for ASR systems. This paper gives the first comparison of Dutch human listeners and Dutch ASR systems on Dutch child speech. It tests whether familiarity with child speech improves human transcription performance and sees if the age of the child speaker influences the transcription performance.
A balanced set of 40 utterances were taken from the JASMIN database (speakers aged 7-11), and were transcribed by 20 humans (10 familiar with child speech (parent/caretaker) and 10 unfamiliar). Transcripts were also gathered from two state-of-the-art ASR systems (Google Telephony and a Conformer model). These transcripts were evaluated against reference transcripts using Word Error Rate (WER). Statistical significance was tested.
Results show that overall ASR transcription performance was comparable to human performance, and in some cases slightly, but not significantly, better. Familiar listeners did not outperform unfamiliar listeners. In fact, there was no significant performance difference between the two groups of humans. Within the 7-11 age range, no clear relationship between speaker age and WER was shown, but results were sensitive to sentence difficulty outliers and "speaker effects". ...
A balanced set of 40 utterances were taken from the JASMIN database (speakers aged 7-11), and were transcribed by 20 humans (10 familiar with child speech (parent/caretaker) and 10 unfamiliar). Transcripts were also gathered from two state-of-the-art ASR systems (Google Telephony and a Conformer model). These transcripts were evaluated against reference transcripts using Word Error Rate (WER). Statistical significance was tested.
Results show that overall ASR transcription performance was comparable to human performance, and in some cases slightly, but not significantly, better. Familiar listeners did not outperform unfamiliar listeners. In fact, there was no significant performance difference between the two groups of humans. Within the 7-11 age range, no clear relationship between speaker age and WER was shown, but results were sensitive to sentence difficulty outliers and "speaker effects". ...
Automatic Speech Recognition (ASR) systems are becoming increasingly common in day-to-day life. Yet, child speech remains challenging for ASR systems. This paper gives the first comparison of Dutch human listeners and Dutch ASR systems on Dutch child speech. It tests whether familiarity with child speech improves human transcription performance and sees if the age of the child speaker influences the transcription performance.
A balanced set of 40 utterances were taken from the JASMIN database (speakers aged 7-11), and were transcribed by 20 humans (10 familiar with child speech (parent/caretaker) and 10 unfamiliar). Transcripts were also gathered from two state-of-the-art ASR systems (Google Telephony and a Conformer model). These transcripts were evaluated against reference transcripts using Word Error Rate (WER). Statistical significance was tested.
Results show that overall ASR transcription performance was comparable to human performance, and in some cases slightly, but not significantly, better. Familiar listeners did not outperform unfamiliar listeners. In fact, there was no significant performance difference between the two groups of humans. Within the 7-11 age range, no clear relationship between speaker age and WER was shown, but results were sensitive to sentence difficulty outliers and "speaker effects".
A balanced set of 40 utterances were taken from the JASMIN database (speakers aged 7-11), and were transcribed by 20 humans (10 familiar with child speech (parent/caretaker) and 10 unfamiliar). Transcripts were also gathered from two state-of-the-art ASR systems (Google Telephony and a Conformer model). These transcripts were evaluated against reference transcripts using Word Error Rate (WER). Statistical significance was tested.
Results show that overall ASR transcription performance was comparable to human performance, and in some cases slightly, but not significantly, better. Familiar listeners did not outperform unfamiliar listeners. In fact, there was no significant performance difference between the two groups of humans. Within the 7-11 age range, no clear relationship between speaker age and WER was shown, but results were sensitive to sentence difficulty outliers and "speaker effects".