Implementing Finite Difference Schemes on Graphics Processing Units

More Info
expand_more

Abstract

The continued development of improved algorithms and architecture for numerical simulations is at the core of increased computational performance and, therefore, the ability to perform more complex and precise numerical simulations in less time in areas such as Computational Fluid Dynamics. Employing faster algorithms on more efficient processing units, such as Graphics Processing Units (GPUs), can reduce not only the time spend per simulation but also the energy required to perform these computations as well. This will be of significant benefit to different areas of research and engineering and through improvement achieved in these to society at large as well.

As simulations grow in dimensions and accuracy the wall clock time is bound to increase substantially, reaching days and potentially months depending on the parameters and geometry of the chosen simulation. The performance of different finite difference solvers with different degrees of optimization on different types of compute hardware was investigated and the achieved speedups assessed. Specifically, one serial CPU-based solver was presented as a baseline, which was then transitioned to a GPU-based solver. This in turn was then optimized further with regards to improved memory redundancy. To make comparisons fairer all of the solvers used the same temporal and spatial discretization techniques. Further, a benchmarking scenario was proposed to be used for the different solvers across used hardware, including the relevant geometry, gird, and initial and boundary conditions.

The speedups between the different solvers were observed and contextualized with regard to the effort that went into implementing the solvers and the capability and cost of the used hardware. The speedups at different problem sizes were investigated with the aim to establish how the performance gain from parallelizing and optimizing solvers scales with the chosen number of grid points and therefore the computational load.

Very significant speedups were achieved between the regular CPU solver and its GPU implementation, clearly showing the possible performance gains when moving from a serial to a parallel implementation running on an accelerator. The speedups between the two GPU-based solvers were more modest but still significant when considering the possible time spend on one simulation, depending on the chosen number of grid points.

Finally, an additional performance analysis was performed on two Navier-Stokes, one of which was the optimized version of the other, to investigate whether the performance increases were in line with prior findings and what magnitude of a reduction of wall clock time was possible for a state-of-the-art finite difference Navier-Stokes solver.