Runtime interval optimization and dependable performance for application-level checkpointing

Conference Paper (2016)
Author(s)

Apostolos Kokolis (National Technical University of Athens)

Alexandros Mavrogiannis (CMU)

Dimitrios Rodopoulos (National Technical University of Athens, Katholieke Universiteit Leuven)

Christos Strydis (Erasmus MC)

Dimitrios Soudris (National Technical University of Athens)

Affiliation
External organisation
DOI related publication
https://doi.org/10.3850/9783981537079_0294 Final published version
More Info
expand_more
Publication Year
2016
Language
English
Affiliation
External organisation
Article number
7459381
Pages (from-to)
594-599
Publisher
IEEE
ISBN (electronic)
9783981537062
Event
19th Design, Automation and Test in Europe Conference and Exhibition, DATE 2016 (2016-03-14 - 2016-03-18), Dresden, Germany
Downloads counter
2

Abstract

As aggressive integration paves the way for performance enhancement of many-core chips and technology nodes go below deca-nanometer dimensions, system-wide failure rates are becoming noticeable. Inevitably, system designers need to properly account for such failures. Checkpoint/Restart (C/R) can be deployed to prolong dependable operation of such systems. However, it introduces additional overheads that lead to performance variability. We present a versatile dependability manager (DepMan) that orchestrates a many-core application-level C/R scheme, while being able to follow time-varying error rates. DepMan also contains a dedicated module that ensures on-the-fly performance dependability for the executing application. We evaluate the performance of our scheme using an error injection module both on the experimental Intel Single-Chip Cloud Computer (SCC) and on a commercial Intel i7 general purpose computer. Runtime checkpoint interval optimization adapts to a variety of failure rates without extra performance or energy costs. The inevitable timing overhead of C/R is reclaimed systematically with Dynamic Voltage and Frequency Scaling (DVFS), so that dependable application performance is ensured.