G. Siachamis
Please Note
10 records found
1
This paper evaluates the state-of-the-art control-based solutions in the autoscaling area with diverse, dynamic workloads, applying specific metrics. We investigate different aspects of the autoscaling problem as performance and convergence. Our experiments reveal that current control-based autoscaling techniques fail to account for generated lag cost by rescaling or underprovisioning and cannot efficiently handle practical scenarios of intensely dynamic workloads. Unexpectedly, we discovered that an autoscaling method not tailored for streaming can outperform others in certain scenarios. ...
This paper evaluates the state-of-the-art control-based solutions in the autoscaling area with diverse, dynamic workloads, applying specific metrics. We investigate different aspects of the autoscaling problem as performance and convergence. Our experiments reveal that current control-based autoscaling techniques fail to account for generated lag cost by rescaling or underprovisioning and cannot efficiently handle practical scenarios of intensely dynamic workloads. Unexpectedly, we discovered that an autoscaling method not tailored for streaming can outperform others in certain scenarios.
In Chapter 2, we study adaptivity to statistical changes through the important task of streaming similarity joins that is heavily affected by imbalanced loads, a by-product of statistical changes. We propose S3J ; the first adaptive distributed streaming similarity joins method in the general metric space that employs a two-layered adaptive partitioning scheme to reduce unnecessary similarity computations and distribute the load to the available workers. Our partitioning scheme is paired with an efficient load balancing scheme that leverages the existing partitioning in order to rebalance any imbalanced load. Our results show that S3J outperforms the employed baseline, inspired by a MapReduce method, in terms of partitioning efficiency. Additionally, our experiments show that the load balancing scheme can gradually defuse the imbalanced load and involve all the available workers in the processing.
The majority of the stream processing engines employ a checkpoint-based fault tolerance mechanism. In Chapter 3, we look at the adaptivity to infrastructure failures through the existing checkpointing protocols. We propose CheckMate, a principled experimental framework for evaluating checkpointing protocols for streaming dataflows. First, we summarize all the essential preliminaries required to study checkpoint-based fault tolerance. Then, we discuss in detail, implement, and evaluate in different scenarios the three main checkpointing protocols. Our evaluation shows that when the load is uniformly distributed, the implemented by most stream processing engines coordinated checkpointing protocol outperforms the alternatives. However, the uncoordinated prevails in the presence of skew, while it shows no domino effect when cyclic queries are employed.
Finally, in Chapter 4, we address the problem of adaptivity to input rate changes. Although multiple solutions have been proposed, their experimental evaluation is shallow and does not include detailed comparisons with other solutions. We propose a principled evaluation framework for stream processing autoscalers. We establish important metrics, queries, and workloads in order to provide guidelines for the evaluation of autoscaling solutions for stream processing. We discuss the state-of-the-art control-based autoscalers, and we evaluate them using the proposed framework. Our results show that, for complex queries, none of the evaluated autoscalers can adapt efficiently, while for simple stateless queries, a simple generic autoscaler outperforms the solutions tailored to stream processing.
We conclude this thesis by summarizing our main findings and discussing the limitations of our work. Based on the valuable insights we gained while designing and implementing the research work included in this thesis, we propose a series of interesting and important future research directions that are not limited to adaptivity problems but address stream processing in general. ...
In Chapter 2, we study adaptivity to statistical changes through the important task of streaming similarity joins that is heavily affected by imbalanced loads, a by-product of statistical changes. We propose S3J ; the first adaptive distributed streaming similarity joins method in the general metric space that employs a two-layered adaptive partitioning scheme to reduce unnecessary similarity computations and distribute the load to the available workers. Our partitioning scheme is paired with an efficient load balancing scheme that leverages the existing partitioning in order to rebalance any imbalanced load. Our results show that S3J outperforms the employed baseline, inspired by a MapReduce method, in terms of partitioning efficiency. Additionally, our experiments show that the load balancing scheme can gradually defuse the imbalanced load and involve all the available workers in the processing.
The majority of the stream processing engines employ a checkpoint-based fault tolerance mechanism. In Chapter 3, we look at the adaptivity to infrastructure failures through the existing checkpointing protocols. We propose CheckMate, a principled experimental framework for evaluating checkpointing protocols for streaming dataflows. First, we summarize all the essential preliminaries required to study checkpoint-based fault tolerance. Then, we discuss in detail, implement, and evaluate in different scenarios the three main checkpointing protocols. Our evaluation shows that when the load is uniformly distributed, the implemented by most stream processing engines coordinated checkpointing protocol outperforms the alternatives. However, the uncoordinated prevails in the presence of skew, while it shows no domino effect when cyclic queries are employed.
Finally, in Chapter 4, we address the problem of adaptivity to input rate changes. Although multiple solutions have been proposed, their experimental evaluation is shallow and does not include detailed comparisons with other solutions. We propose a principled evaluation framework for stream processing autoscalers. We establish important metrics, queries, and workloads in order to provide guidelines for the evaluation of autoscaling solutions for stream processing. We discuss the state-of-the-art control-based autoscalers, and we evaluate them using the proposed framework. Our results show that, for complex queries, none of the evaluated autoscalers can adapt efficiently, while for simple stateless queries, a simple generic autoscaler outperforms the solutions tailored to stream processing.
We conclude this thesis by summarizing our main findings and discussing the limitations of our work. Based on the valuable insights we gained while designing and implementing the research work included in this thesis, we propose a series of interesting and important future research directions that are not limited to adaptivity problems but address stream processing in general.
In this work, we evaluate autoscaling solutions for stream processing engines. Although autoscaling has become a mainstream subject of research in the last decade, the database research community has yet to evaluate different autoscaling techniques under a proper benchmarking setting and evaluation framework. As a result, every newly proposed autoscaling solution only performs a shallow performance evaluation and comparison against existing solutions. In this paper, we evaluate autoscaling solutions by employing two streaming queries and a dynamic workload that follows a cosinus pattern. Our experiments reveal that current autoscaling techniques fail to account for generated lag due to rescaling or underprovisioning and cannot efficiently handle practical scenarios of intensely dynamic workloads.
The goal of this doctoral work is to design and provide scalable methods to support data integration tasks on massive data streams, i.e., support streaming data integration. The aim of this work is threefold. First, we aim at developing and proposing streaming methods to compute temporal stream data-profiles and summaries that can describe the dynamic state of a stream in the course of time. Second, we aim at developing methods and metrics of stream similarity. Those methods and metrics can serve as means to detect similar or complementary streams in a streaming data lake. Finally, we aim at optimizing distributed streaming similarity joins - a very important operation that precedes entity linking and resolution. This paper discusses exciting challenges and open problems in the field, and a research plan on tackling them. ...
The goal of this doctoral work is to design and provide scalable methods to support data integration tasks on massive data streams, i.e., support streaming data integration. The aim of this work is threefold. First, we aim at developing and proposing streaming methods to compute temporal stream data-profiles and summaries that can describe the dynamic state of a stream in the course of time. Second, we aim at developing methods and metrics of stream similarity. Those methods and metrics can serve as means to detect similar or complementary streams in a streaming data lake. Finally, we aim at optimizing distributed streaming similarity joins - a very important operation that precedes entity linking and resolution. This paper discusses exciting challenges and open problems in the field, and a research plan on tackling them.
Valentine in Action
Matching Tabular Data at Scale