Experimental evaluation of distributed similarity joins in stream processing environments

More Info
expand_more

Abstract

Similarity joins are operations which involve identifying similar pairs of records within one or multiple datasets. These operations are typically time-sensitive, as timely identification of relations can lead to increased profitability. Therefore, it is advantageous to analyze them using a stream processing system, which offers real-time capabilities. Due to the computational complexity of comparing numerous records, similarity joins can be resource-intensive.

To address this challenge, employing a distributed setting for executing the operations proves to be the most effective approach for resource management. In this research, we evaluate four distinct distributed systems designed for similarity joins in stream processing environments. The primary objective is to assess their individual strengths and weaknesses, as well as their overall efficiency. Our investigation reveals that certain solutions exhibit superior scalability and resource utilization, while highlighting the potential for further advancements in this domain.

Files