Reservoir Simulation of Foam Flow using a Kepler GPU

More Info
expand_more

Abstract

In recent years, along with the higher GPU’s computational speed and memory bandwidth compared to those of CPUs, GPU-accelerated reservoir simulation has been studied quite extensively. The results so far have shown that Fermi generation GPUs could accelerate IMPES (implicit pressure explicit saturation) reservoir simulation considerably. Along with several new features, the current generation Kepler GPU has been improved from its previous generation by having much higher FLOPS and memory bandwidth. However, major changes in Kepler, such as the removal of automatic L1 global memory caching and the requirement of instruction level parallelism, make additional optimizations essential to obtain close-to-peak performance. On the application side, researchers have found that foam can be used to improve gas injection by addressing several causes of poor gas sweep efficiency. However, foam simulation is hampered by long simulation time because of its large fractional flow slope. In this paper, it will be discussed how to implement efficient IMPES reservoir simulation on current generation Kepler GPUs, and how to apply it to foam simulation. The IMPES code is optimized by maximizing exposed parallelism (thread and instruction level parallelism-TLP and ILP), coalescing global memory access, reducing redundant global memory access by explicitly using shared memory via warp specialization while avoiding memory bank conflict, using 1D texture memory as a pre-computed table and using various forms of GPU read-only memory. Furthermore, since reservoir simulation components such as sparse matrix formats, preconditioners and solvers that work excellent on CPUs might not be efficient on GPUs, the components are chosen so that they not only just work efficiently for foam simulation but also for foam simulation on GPUs. For the example considered in this report, and using a GTX Titan Black GPU, speed-ups up to 129 times in the saturation update and matrix assembly can be obtained compared to a parallel implementation on an Intel quad Core i7-4770k. For the pressure solver part, the GPU implementation is up to 39 x faster compared to the CPU implementation. The maximum solver speedup is achieved for large models (with more than 6 million grid cells), whereas for smaller models (a million cells or less), the speedup is reduced because GPU-CPU data transfer latency might still be dominant and small data fits in the CPU cache. Overall, the use of a GPU makes large-scale foam simulations (with six million grid cells or more) possible to be completed in days, whereas it is predicted that it will take months to complete the same simulation using a CPU.