Estimation of Similarity Between Data Streams Using Probabilistic Data Structures

More Info
expand_more

Abstract

This thesis embarks on the quest to efficiently compute similarities between data streams in real-time, a task burgeoning in importance with the advent of big data and real-time analytics. At the heart of this endeavor is the expansion of the Condor framework to accommodate new probabilistic data structures, tailored to meet the distinctive challenges posed by streaming data. A notable highlight is the adaptation of the DSTree data structure to a streaming environment, marking a significant stride towards achieving the stated goal. Through an implementation within the Condor framework, this research explores the core mechanisms for indexing and approximating similarities, paving the way for more refined analyses. Furthermore, a comparative study is conducted encompassing several probabilistic data structures, including HyperLogLog and Theta Sketches, examining their effectiveness in similarity search within a streaming environment, in comparison with the DSTree method. The evaluation of these methods will be done through a series of experiments, which are meticulously designed to measure the accuracy and efficiency of these structures, shedding light on their potential and limitations. he insights garnered from this study underscore the potential of probabilistic data structures in bolstering the speed and accuracy of similarity search in streaming data, while also hinting at promising avenues for further research.