Cold start is coming: How to approximate the optimal set of initial prototypes for clustering sequence data online
S. Fucarev (TU Delft - Electrical Engineering, Mathematics and Computer Science)
A. Nadeem – Mentor (TU Delft - Cyber Security)
Sicco Verwer – Mentor (TU Delft - Cyber Security)
Gosia Migut – Graduation committee member (TU Delft - Computer Science & Engineering-Teaching Team)
More Info
expand_more
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.
Abstract
Clustering data is a classic topic in the academic community and in the industry. It is by and large one of the most popular unsupervised classification techniques. It is fast and flexible as it can accommodate all kinds of data when a suitable similarity metric is found. SeqClu is an online k-medoids prototype based clustering algorithm designed to handle large quantities of sequence data. Our main focus is the role initialization plays in the performance of SeqClu. In this paper we show that Greedy Heuristics perform significantly better than K-medoids heuristics. In the context of Greedy Heuristics we show that these can be combined together to achieve potentially better accuracy if a proper metric to choose the initialization results is elected.