Estimation of Similarity Between Data Streams Using Probabilistic Data Structures

None, None

Estimation of Similarity Between Data Streams Using Probabilistic Data Structures

Master Thesis (2024)

Author(s)

P. Reppas (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

Asterios Katsifodimos – Mentor (TU Delft - Data-Intensive Systems)

G. Siachamis – Coach (TU Delft - Web Information Systems)

Faculty

Electrical Engineering, Mathematics and Computer Science

Data streams Probabilistic data structures

To reference this document use:

https://resolver.tudelft.nl/uuid:0d1994ec-4d32-4dbb-a0e9-27fdfab58780

More Info

expand_more

Publication Year

2024

Language

English

Graduation Date

26-02-2024

Awarding Institution

Delft University of Technology

Project

['Similarity estimation']

Programme

['Computer Science | Software Technology']

Faculty

Electrical Engineering, Mathematics and Computer Science

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

This thesis embarks on the quest to efficiently compute similarities between data streams in real-time, a task burgeoning in importance with the advent of big data and real-time analytics. At the heart of this endeavor is the expansion of the Condor framework to accommodate new probabilistic data structures, tailored to meet the distinctive challenges posed by streaming data. A notable highlight is the adaptation of the DSTree data structure to a streaming environment, marking a significant stride towards achieving the stated goal. Through an implementation within the Condor framework, this research explores the core mechanisms for indexing and approximating similarities, paving the way for more refined analyses. Furthermore, a comparative study is conducted encompassing several probabilistic data structures, including HyperLogLog and Theta Sketches, examining their effectiveness in similarity search within a streaming environment, in comparison with the DSTree method. The evaluation of these methods will be done through a series of experiments, which are meticulously designed to measure the accuracy and efficiency of these structures, shedding light on their potential and limitations. he insights garnered from this study underscore the potential of probabilistic data structures in bolstering the speed and accuracy of similarity search in streaming data, while also hinting at promising avenues for further research.

Files

Estimation_of_Similarity_Betwe... (pdf)

(pdf | 6.73 Mb)

License info not available