Human vs AI: Recognising Teenage Speech

None, None

Human vs AI: Recognising Teenage Speech

How good are humans at recognizing teenage speech samples compared to state-of-the-art AI-based automatic speech recognisers?

Bachelor Thesis (2026)

Author(s)

G. SINGH (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

Y. Zhang – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

O.E. Scharenborg – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

B.J.W. Dudzik – Graduation committee member (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Faculty

Electrical Engineering, Mathematics and Computer Science

Automatic Speech Recognition Teenage Speech Human Machine Comparison Word Error Rate (WER)

To reference this document use

https://resolver.tudelft.nl/uuid:1048e6d5-2bcb-4723-8473-ab30f04acf0e

More Info

expand_more

Publication Year

2026

Language

English

Graduation Date

29-01-2026

Awarding Institution

Delft University of Technology

Project

CSE3000 Research Project

Programme

Computer Science and Engineering

Faculty

Electrical Engineering, Mathematics and Computer Science

Downloads counter

168

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Automatic Speech Recognition (ASR) systems have achieved remarkable performance in recent years, however their robustness against diverse speech, such as teenage speech, remains an area of investigation and research. This study evaluates the performance of a state-of-the-art ASR system (Google Telephony) compared to human listeners in transcribing native Dutch teenage speech. Additionally, it investigates whether a listener's social exposure to teenage speech influences their recognition accuracy. A listening experiment was conducted using a dataset of 40 speech samples of Human Machine Interaction (HMI) speech from native Dutch speakers aged 14 to 16, curated from the JASMIN corpus. The audio samples were transcribed by the ASR model and a group of young adult participants (aged 20--24). Performance was evaluated using Word Error Rate (WER), with a specific focus on the impact of normalizing common Dutch contractions and clitics. The results demonstrate that the ASR system outperformed the average human listener, achieving a lower WER compared to both groups of participants, the one with exposure to teenage speech and the one without. However, human participants with regular social exposure to teenagers performed significantly better on average than those without, confirming that familiarity with the demographic improves recognition accuracy. Furthermore, the analysis reveals that orthographic inconsistencies regarding contractions significantly inflate WERs, with normalization reducing the WERs by quite an extent. These findings suggest that while current ASR models are highly robust for this demographic, human domain knowledge remains a relevant factor in understanding teenage speech speech patterns based on the lower WERs of the humans with exposure to teenage speech than those without.

Files

Final_Research_Paper.pdf

(pdf | 0.45 Mb)

License info not available