Adding fault tolerance to OpenCL

Through redundant heterogeneous computing

More Info
expand_more

Abstract

The ever-increasing demand for computing has led to the need for specialized heterogeneous hardware, and the frameworks required to utilize them. Besides the traditional central processing units, more and more programs will make use of specialized hardware to accelerate computations. However, the increase in computing also leads to shorter mean time between failures. In this thesis, we apply fault tolerance to Portable Computing Language (PoCL), an open-source implementation of the OpenCL standard. We show that our solution is easy to apply to existing programs making use of PoCL/OpenCL and is able to greatly reduce the total number of errors visible to the end user. Our solution can be used on any device supported by PoCL and provides a low overhead, given that the hardware requirements are met.