Performance-Oriented Fault Tolerance in Computing Systems

More Info
expand_more

Abstract

In this dissertation we address the overhead reduction of fault tolerance (FT) techniques. Due to technology trends such as decreasing feature sizes and lowering voltage levels, FT is becoming increasingly important in modern computing systems. FT techniques are based on some form of redundancy. It can be space redundancy (additional hardware), time redundancy (multiple executions), and/or information redundancy (additional verification information). This redundancy significantly increases the system cost and/or degrades its performance, which is not acceptable in many cases. This dissertation proposes several methods to reduce the overhead of FT techniques. In most cases the overhead due to time redundancy is targeted, although some techniques can also be used to reduce the overhead of other forms of redundancy. Many time-redundant FT techniques are based on executing instructions multiple times. Redundant instruction copies are created either in hardware or software, and their results are compared to detect possible faults. This dissertation conjectures that different instructions need varying protection levels for the reliable application execution. Possible ways to assign proper protection levels to different instructions are investigated, such as the novel concept of instruction vulnerability factor. By protecting critical instructions better than others, significant performance improvements, power savings, and/or system cost reductions can be achieved. In addition it is proposed to employ instruction reuse techniques such as precomputation and memoization to reduce the number of instructions to be re-executed for fault detection. Multicore systems have recently gained significant attention due to the popular conviction that the instruction level parallelism has reached its limits, and due to power density constraints. In a cache coherent multicore system the correct functionality of the cache coherence protocol is essential for the system operation. This dissertation also proposes a cache coherence verification technique which detects faults at a lower cost than previously proposed methods.

Files