Experimental evaluation of distributed similarity joins in stream processing environments

Master thesis (2023)

Authors

T. Hernandez Quintanilla Electrical Engineering, Mathematics and Computer Science

Contributors

A Katsifodimos Web Information Systems - (supervisor 1)

G. Siachamis Web Information Systems - (supervisor 2)

Faculty

Electrical Engineering, Mathematics and Computer Science

More Info

expand_more

To reference this document use:

http://resolver.tudelft.nl/uuid:0b876868-c6d5-4a82-86a8-dc581c8b12a6

Published Date

26-09-2023

Language

English

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Faculty

Electrical Engineering, Mathematics and Computer Science

Abstract

Similarity joins are operations which involve identifying similar pairs of records within one or multiple datasets. These operations are typically time-sensitive, as timely identification of relations can lead to increased profitability. Therefore, it is advantageous to analyze them using a stream processing system, which offers real-time capabilities. Due to the computational complexity of comparing numerous records, similarity joins can be resource-intensive.

To address this challenge, employing a distributed setting for executing the operations proves to be the most effective approach for resource management. In this research, we evaluate four distinct distributed systems designed for similarity joins in stream processing environments. The primary objective is to assess their individual strengths and weaknesses, as well as their overall efficiency. Our investigation reveals that certain solutions exhibit superior scalability and resource utilization, while highlighting the potential for further advancements in this domain.

Files

Thesis.pdf

(.pdf | 1.51 Mb)