Comparing Human Listeners and Dutch ASR on Transcribing Child Speech

None, None

Comparing Human Listeners and Dutch ASR on Transcribing Child Speech

The Effect of Familiarity with Child Speech on Transcription Performance

Bachelor Thesis (2026)

Author(s)

I.N. Huisman (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

O.E. Scharenborg – Mentor (TU Delft - Multimedia Computing)

B.J.W. Dudzik – Graduation committee member (TU Delft - Pattern Recognition and Bioinformatics)

Parents Familiarity Transcription Child Speech Automatic Speech Recognition systems

To reference this document use

https://resolver.tudelft.nl/uuid:25e0be9f-2db4-4e2f-9831-d6c8faaf2783

More Info

expand_more

Publication Year

2026

Language

English

Graduation Date

29-01-2026

Awarding Institution

Project

CSE3000 Research Project

Programme

Computer Science and Engineering

Downloads counter

23

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Automatic Speech Recognition (ASR) systems are becoming increasingly common in day-to-day life. Yet, child speech remains challenging for ASR systems. This paper gives the first comparison of Dutch human listeners and Dutch ASR systems on Dutch child speech. It tests whether familiarity with child speech improves human transcription performance and sees if the age of the child speaker influences the transcription performance.

A balanced set of 40 utterances were taken from the JASMIN database (speakers aged 7-11), and were transcribed by 20 humans (10 familiar with child speech (parent/caretaker) and 10 unfamiliar). Transcripts were also gathered from two state-of-the-art ASR systems (Google Telephony and a Conformer model). These transcripts were evaluated against reference transcripts using Word Error Rate (WER). Statistical significance was tested.

Results show that overall ASR transcription performance was comparable to human performance, and in some cases slightly, but not significantly, better. Familiar listeners did not outperform unfamiliar listeners. In fact, there was no significant performance difference between the two groups of humans. Within the 7-11 age range, no clear relationship between speaker age and WER was shown, but results were sensitive to sentence difficulty outliers and "speaker effects".

Files

Final_Paper_Ilse_Huisman.pdf

(pdf | 0.84 Mb)

License info not available