Human vs AI: Recognising Teenage Speech

How good are humans at recognizing teenage speech samples compared to state-of-the-art AI-based automatic speech recognisers?

Bachelor Thesis (2026)
Author(s)

G. SINGH (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

Y. Zhang – Mentor (TU Delft - Multimedia Computing)

O.E. Scharenborg – Mentor (TU Delft - Multimedia Computing)

B.J.W. Dudzik – Graduation committee member (TU Delft - Pattern Recognition and Bioinformatics)

More Info
expand_more
Publication Year
2026
Language
English
Graduation Date
29-01-2026
Awarding Institution
Project
CSE3000 Research Project
Programme
Computer Science and Engineering
Downloads counter
30
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Automatic Speech Recognition (ASR) systems have achieved remarkable performance in recent years, however their robustness against diverse speech, such as teenage speech, remains an area of investigation and research. This study evaluates the performance of a state-of-the-art ASR system (Google Telephony) compared to human listeners in transcribing native Dutch teenage speech. Additionally, it investigates whether a listener's social exposure to teenage speech influences their recognition accuracy. A listening experiment was conducted using a dataset of 40 speech samples of Human Machine Interaction (HMI) speech from native Dutch speakers aged 14 to 16, curated from the JASMIN corpus. The audio samples were transcribed by the ASR model and a group of young adult participants (aged 20--24). Performance was evaluated using Word Error Rate (WER), with a specific focus on the impact of normalizing common Dutch contractions and clitics. The results demonstrate that the ASR system outperformed the average human listener, achieving a lower WER compared to both groups of participants, the one with exposure to teenage speech and the one without. However, human participants with regular social exposure to teenagers performed significantly better on average than those without, confirming that familiarity with the demographic improves recognition accuracy. Furthermore, the analysis reveals that orthographic inconsistencies regarding contractions significantly inflate WERs, with normalization reducing the WERs by quite an extent. These findings suggest that while current ASR models are highly robust for this demographic, human domain knowledge remains a relevant factor in understanding teenage speech speech patterns based on the lower WERs of the humans with exposure to teenage speech than those without.

Files

Final_Research_Paper.pdf
(pdf | 0.45 Mb)
License info not available