CRAFT
A library for easier application-level Checkpoint/Restart and Automatic Fault Tolerance
Faisal Shahzad (Friedrich-Alexander-Universität Erlangen-Nürnberg)
Jonas Thies (Deutsches Zentrum für Luft- und Raumfahrt (DLR))
Moritz Kreutzer (Friedrich-Alexander-Universität Erlangen-Nürnberg)
Thomas Zeiser (Friedrich-Alexander-Universität Erlangen-Nürnberg)
Georg Hager (Friedrich-Alexander-Universität Erlangen-Nürnberg)
Gerhard Wellein (Friedrich-Alexander-Universität Erlangen-Nürnberg)
More Info
expand_more
Abstract
In order to efficiently use the future generations of supercomputers, fault tolerance and power consumption are two of the prime challenges anticipated by the High Performance Computing (HPC) community. Checkpoint/Restart (CR) has been and still is the most widely used technique to deal with hard failures. Application-level CR is the most effective CR technique in terms of overhead efficiency but it takes a lot of implementation effort.
No files available
Metadata only record. There are no files for this record.