Integrating Massive Data Streams
More Info
expand_more
Abstract
Data Integration has been a long-standing and challenging problem for enterprises and researchers. Data residing in multiple heterogeneous sources must be integrated and prepared such that the valuable information that it carries, can be extracted and analysed. However, the volume and the velocity of the produced data in addition to the modern business needs for real-time results have pushed data analytics, and therefore data integration, towards data streams. While data integration is a hard problem in and of itself, integrating data streams becomes even more challenging. Streams are characterized by their high velocity, infinite nature and predisposition to concept drift.
The goal of this doctoral work is to design and provide scalable methods to support data integration tasks on massive data streams, i.e., support streaming data integration. The aim of this work is threefold. First, we aim at developing and proposing streaming methods to compute temporal stream data-profiles and summaries that can describe the dynamic state of a stream in the course of time. Second, we aim at developing methods and metrics of stream similarity. Those methods and metrics can serve as means to detect similar or complementary streams in a streaming data lake. Finally, we aim at optimizing distributed streaming similarity joins - a very important operation that precedes entity linking and resolution. This paper discusses exciting challenges and open problems in the field, and a research plan on tackling them.