Benchmarking Checkpoint-based Fault Tolerance Algorithms in Stateful Stream Processing

More Info
expand_more

Abstract

Major advances in the fault tolerance of distributed stream processing systems provided the systems with the capacity to produce strictly consistent results under failures. Consistent fault tolerance has been one of the catalysts fueling the maturity of streaming systems and boosting their widespread adoption not only for analytics use cases, but also as a platform for running applications, such as microservices and stateful functions. The most common fault tolerance strategy that many modern streaming systems adopt is an adaptation of Chandy-Lamport's distributed snapshots protocol that is based on global coordinated checkpoints. Given the importance of fault tolerance and the impact of coordinated checkpoints in the stream processing space, it is surprising that no other fault tolerance algorithm has been considered as an alternative. In this paper we bring together and benchmark three seminal fault tolerance algorithms from the distributed systems literature: the coordinated, uncoordinated, and communication-induced checkpoint algorithms. We evaluate their behavior in terms of runtime performance and failure recovery on a variety of streaming workloads including pipeline, scatter-gather, and cyclical topologies. We benchmark the algorithms on a novel streaming system, FERDiS, built from scratch as an extensible benchmarking framework. The experiment results show that the coordinated approach that state of the art streaming systems adopt is in many cases optimal, but with some exceptions where it is outperformed in both runtime performance and failure recovery by the uncoordinated and communication-induced algorithms. The cause of the differences in performance, both at runtime and during recovery, was found to be strongly related to the characteristics of the input stream(s) and the query graph. On the tested cyclic topology, the coordinated checkpointing algorithm could not be applied as by its design it will deadlock on such query graphs. The natural cycle support of the uncoordinated and communication -induced proved advantageous here and we observed high susceptibility to the unbounded domino effect of the uncoordinated algorithm which the communication-induced was not affected by, indicating the unsuitability of uncoordinated checkpointing in cyclic stream processing.