Educational Content on YouTube: The Case of Data Systems
X. Ling (TU Delft - Electrical Engineering, Mathematics and Computer Science)
Sole Pera – Mentor (TU Delft - Web Information Systems)
E.A. Aivaloglou – Mentor (TU Delft - Web Information Systems)
A Katsifodimos – Graduation committee member (TU Delft - Data-Intensive Systems)
More Info
expand_more
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.
Abstract
The advancement of data systems demands continuous learning, yet traditional educational materials often fall short of meeting evolving learning needs. YouTube has emerged as a widely used platform for informal learning, but its role in data systems education remains underexamined. This thesis addresses that gap by constructing a curated dataset of 17,434 instructional YouTube videos related to data systems, focusing on content availability and organization, key video characteristics, factors influencing audience engagement, and subtopic coverage in SQL education. Using a curriculum-aligned query strategy and a machine learning filtering pipeline, the dataset maintains relevance to educational objectives and offers broad topical coverage within the data systems domain. Video characteristics are analyzed across dimensions such as content volume, engagement metrics, transcript availability, language, geographic origin, and topic distribution. Findings reveal that content and engagement are highly uneven, with a small subset of videos, channels, languages, countries, and topics capturing disproportionate attention. Statistical modeling shows that engagement in this domain is positively associated with longer video duration, SQL focus, and high-subscriber channels, while overly long titles, frequent uploads, and older channels correlate negatively. Subtle patterns suggest that culturally or regionally tailored content may further enhance engagement. While SQL-related topics dominate in volume and engagement, a subtopic classification of 4,242 SQL videos reveals that although 87\% of textbook-derived subtopics are covered, content is heavily concentrated on core querying and schema commands. Advanced, theoretical, systems-level, and integration-related topics are rarely addressed.