Runtime interval optimization and dependable performance for application-level checkpointing
Apostolos Kokolis (National Technical University of Athens)
Alexandros Mavrogiannis (CMU)
Dimitrios Rodopoulos (National Technical University of Athens, Katholieke Universiteit Leuven)
Christos Strydis (Erasmus MC)
Dimitrios Soudris (National Technical University of Athens)
More Info
expand_more
Abstract
As aggressive integration paves the way for performance enhancement of many-core chips and technology nodes go below deca-nanometer dimensions, system-wide failure rates are becoming noticeable. Inevitably, system designers need to properly account for such failures. Checkpoint/Restart (C/R) can be deployed to prolong dependable operation of such systems. However, it introduces additional overheads that lead to performance variability. We present a versatile dependability manager (DepMan) that orchestrates a many-core application-level C/R scheme, while being able to follow time-varying error rates. DepMan also contains a dedicated module that ensures on-the-fly performance dependability for the executing application. We evaluate the performance of our scheme using an error injection module both on the experimental Intel Single-Chip Cloud Computer (SCC) and on a commercial Intel i7 general purpose computer. Runtime checkpoint interval optimization adapts to a variety of failure rates without extra performance or energy costs. The inevitable timing overhead of C/R is reclaimed systematically with Dynamic Voltage and Frequency Scaling (DVFS), so that dependable application performance is ensured.