Comparing Human Listeners and Dutch ASR on Transcribing Child Speech

The Effect of Familiarity with Child Speech on Transcription Performance

Bachelor Thesis (2026)
Author(s)

I.N. Huisman (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

O.E. Scharenborg – Mentor (TU Delft - Multimedia Computing)

B.J.W. Dudzik – Graduation committee member (TU Delft - Pattern Recognition and Bioinformatics)

More Info
expand_more
Publication Year
2026
Language
English
Graduation Date
29-01-2026
Awarding Institution
Project
CSE3000 Research Project
Programme
Computer Science and Engineering
Downloads counter
23
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Automatic Speech Recognition (ASR) systems are becoming increasingly common in day-to-day life. Yet, child speech remains challenging for ASR systems. This paper gives the first comparison of Dutch human listeners and Dutch ASR systems on Dutch child speech. It tests whether familiarity with child speech improves human transcription performance and sees if the age of the child speaker influences the transcription performance.

A balanced set of 40 utterances were taken from the JASMIN database (speakers aged 7-11), and were transcribed by 20 humans (10 familiar with child speech (parent/caretaker) and 10 unfamiliar). Transcripts were also gathered from two state-of-the-art ASR systems (Google Telephony and a Conformer model). These transcripts were evaluated against reference transcripts using Word Error Rate (WER). Statistical significance was tested.

Results show that overall ASR transcription performance was comparable to human performance, and in some cases slightly, but not significantly, better. Familiar listeners did not outperform unfamiliar listeners. In fact, there was no significant performance difference between the two groups of humans. Within the 7-11 age range, no clear relationship between speaker age and WER was shown, but results were sensitive to sentence difficulty outliers and "speaker effects".

Files

License info not available