Adding fault tolerance to OpenCL
Through redundant heterogeneous computing
R.A. Bijl (TU Delft - Electrical Engineering, Mathematics and Computer Science)
Z. Al-Ars – Mentor (TU Delft - Computer Engineering)
C. Lofi – Graduation committee member (TU Delft - Web Information Systems)
Pekka Jaäskelaïnen – Coach (Tampere University)
More Info
expand_more
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.
Abstract
The ever-increasing demand for computing has led to the need for specialized heterogeneous hardware, and the frameworks required to utilize them. Besides the traditional central processing units, more and more programs will make use of specialized hardware to accelerate computations. However, the increase in computing also leads to shorter mean time between failures. In this thesis, we apply fault tolerance to Portable Computing Language (PoCL), an open-source implementation of the OpenCL standard. We show that our solution is easy to apply to existing programs making use of PoCL/OpenCL and is able to greatly reduce the total number of errors visible to the end user. Our solution can be used on any device supported by PoCL and provides a low overhead, given that the hardware requirements are met.