ELM Decision Branching

Improving streaming clustering with decision branching

More Info
expand_more

Abstract

Clustering is a commonly used method in data analysis. It is a complex problem that can be very time consuming, especially when clustering large datasets with many features. Most clustering algorithms scale exponentially in time when increasing the dataset size, making it infeasible to use them for large datasets. Streaming algorithms do not have this problem as they will scale linearly. Evolving Local Means (ELM) is a streaming clustering algorithm based on Mean Shift. The ELM algorithm iterates over all data samples in a single iteration, updating an internal structure of clusters with each sample being added. Similar to Mean Shift, ELM requires only one parameter to be provided by the user: the initial radius of a cluster. In ELM, each sample can either be added to the nearest cluster or turned into a new cluster, depending on the distance to the nearest cluster. Clusters that come too close to each other will be merged to form a single cluster. There is one issue with this approach: the decisions are based on an intermediate state that is constantly evolving. This might cause incorrect clustering results when ELM merges clusters together prematurely. This research introduces ELM Decision Branching (ELM-DB) which extends ELM with decision branching. ELM-DB aims to reduce the chance of a premature merge by postponing important decisions. Instead of making a decision directly, it will branch and continue all possible options until the decision can be made. ELM-DB is able to improve clustering performance with only a small increase in runtime and produce consistent results for a larger range of radius settings than ELM.