Comparison of Linguistic Language Classification based on Origin and Data Driven Language Classification using the IPA and Clustering
I.G.M. Rethans (TU Delft - Electrical Engineering, Mathematics and Computer Science)
More Info
expand_more
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.
Abstract
Language similarity is very useful for enrichment data in both Natural Lanuguage Processing (NLP) and Automatic Speech Recognition (ASR). A clustering algorithm could provide an efficient means to define language similarity in a data-driven way. This research investigates the relation between linguistic classification by origin and data driven classification based on the pronunciation of languages using k-means clustering where the focus is placed
on the Indo-European languages. The results show large variation in cluster results and consequently large variation in correspondence with linguistic
classification. This is caused by a relatively even spread of the data over the feature space. Still, the results indicate significance in the relation between
the two classification methods. Furthermore, this research functions as a foundation and a source of inspiration for a lot of possible future research.