Adaptivity for Streaming Dataflow Engines

More Info
expand_more

Abstract

Data processing has heavily evolved in the last two decades, from single-node processing to distributed processing and from the MapReduce paradigm to the stream processing paradigm. At the same time, cloud computing has emerged as the primary means of deploying and operating a data processing system. In the cloud era, flexible resource allocation combined with flexible pricing schemes have brought forward new opportunities and have democratized access to computing resources. However, streaming dataflow or stream processing engines were originally designed for in-house clusters of fixed resources with limited needs for adaptivity. Therefore, they lack the mechanisms to adapt to unexpected changes in the needs of the processing workload. When solutions have been proposed in the literature, their experimental evaluation is limited hindering the progress of the field. The same applies to the native fault tolerance mechanisms that virtually every stream processing engine employs. In this thesis, we study the problem of adaptivity for streaming dataflow engines, and we focus on three major adaptivity subproblems: adaptivity to 𝑖) statistical changes, 𝑖𝑖) infrastructure failures, and 𝑖𝑖𝑖) input rate changes.
In Chapter 2, we study adaptivity to statistical changes through the important task of streaming similarity joins that is heavily affected by imbalanced loads, a by-product of statistical changes. We propose S3J ; the first adaptive distributed streaming similarity joins method in the general metric space that employs a two-layered adaptive partitioning scheme to reduce unnecessary similarity computations and distribute the load to the available workers. Our partitioning scheme is paired with an efficient load balancing scheme that leverages the existing partitioning in order to rebalance any imbalanced load. Our results show that S3J outperforms the employed baseline, inspired by a MapReduce method, in terms of partitioning efficiency. Additionally, our experiments show that the load balancing scheme can gradually defuse the imbalanced load and involve all the available workers in the processing.
The majority of the stream processing engines employ a checkpoint-based fault tolerance mechanism. In Chapter 3, we look at the adaptivity to infrastructure failures through the existing checkpointing protocols. We propose CheckMate, a principled experimental framework for evaluating checkpointing protocols for streaming dataflows. First, we summarize all the essential preliminaries required to study checkpoint-based fault tolerance. Then, we discuss in detail, implement, and evaluate in different scenarios the three main checkpointing protocols. Our evaluation shows that when the load is uniformly distributed, the implemented by most stream processing engines coordinated checkpointing protocol outperforms the alternatives. However, the uncoordinated prevails in the presence of skew, while it shows no domino effect when cyclic queries are employed.
Finally, in Chapter 4, we address the problem of adaptivity to input rate changes. Although multiple solutions have been proposed, their experimental evaluation is shallow and does not include detailed comparisons with other solutions. We propose a principled evaluation framework for stream processing autoscalers. We establish important metrics, queries, and workloads in order to provide guidelines for the evaluation of autoscaling solutions for stream processing. We discuss the state-of-the-art control-based autoscalers, and we evaluate them using the proposed framework. Our results show that, for complex queries, none of the evaluated autoscalers can adapt efficiently, while for simple stateless queries, a simple generic autoscaler outperforms the solutions tailored to stream processing.
We conclude this thesis by summarizing our main findings and discussing the limitations of our work. Based on the valuable insights we gained while designing and implementing the research work included in this thesis, we propose a series of interesting and important future research directions that are not limited to adaptivity problems but address stream processing in general.