Efficient Matrix Operations and Parallel Programming for the Conjugate Gradient method on the DelftBlue
P.O.H. de la Court (TU Delft - Electrical Engineering, Mathematics and Computer Science)
Martin Van Gijzen – Mentor (TU Delft - Numerical Analysis)
D.J.P. Lahaye – Graduation committee member (TU Delft - Mathematical Physics)
More Info
expand_more
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.
Abstract
This research investigates efficient matrix-vector multiplication and storage techniques for a Fortran implementation of the Conjugate Gradient (CG) method on GPUs. The CG method is a widely used iterative algorithm for solving large, sparse linear systems.
Because its performance is heavily influenced by the efficiency of matrix operations it is important to make use of parallel computing to run the computations on a GPU. Three types of matrices were considered as A, a dense symmetric positive-definite matrix and two sparse matrices, defined by the discretization of a 1D Laplace equation and a 2D Poisson equation respectively. For sparse matrices, the Compressed Diagonal Storage (CDS) format was implemented to reduce memory usage and computational cost. For parallel execution, three implementations were benchmarked: Standard Fortran, OpenACC, and CUDA Fortran. The results show that CDS significantly improves both memory efficiency and runtime performance for sparse matrices, while CUDA outperforms OpenACC and Standard Fortran in terms of speed, especially for large matrices. Standard Fortran remains competitive for small matrices due to low overhead, but it scales poorly. The convergence behavior of the CG method was also found to be highly dependent on the condition number of the matrix, with the 2D Poisson matrix exhibiting much faster convergence than the 1D Laplace matrix.
This study concludes that efficient storage and GPU-based matrix operations, particularly using CUDA, are critical for scalable performance in solving large linear systems with the CG method. Future work could explore combining the portability and ease of use of Standard Fortran with GPU acceleration to achieve both maintainability and high performance.