Benchmarking checkpointing algorithms in Stream Processing Engines

More Info
expand_more

Abstract

The use of data streams has increased a lot over the last two decades or so. and
With this increase comes the need for fast and consistent fault recovery. Rollback
recovery mechanisms from traditional distributed systems have been adapted successfully for stream engines. These mechanisms can be categorized into one of three different categories; uncoordinated, coordinated and communication induced protocols. While most well-known stream engines implement a variant of the coordinated Chandy-Lamport algorithm, there is no practical comparison available that actually confirms whether this is the optimal solution for data streams specifically. Compared to traditional distributed processing solutions, stream processing has a higher need for low latencies due to the continuously generated input. This paper aims to create more insight into the advantages and disadvantages of these solutions by implementing a checkpointing algorithm for each of these categories. These are then benchmarked using various workloads and evaluated using a number of metrics such as latency, throughput, recovery times and network overhead. From these results, it can be concluded that a coordinated approach indeed outperforms uncoordinated solutions across all of these metrics, most likely due to the need for message logging in both the uncoordinated and communication induced scenarios. Additionally the benchmarks indicate that the overhead of the communication induced approach does not outweigh its benefits, due to the rarity of the occurrence of the so-called domino effect.