Performance engineering for a tall & skinny matrix multiplication kernels on GPUs

Conference paper (2020)

Authors

Dominik Ernst Friedrich-Alexander-Universität Erlangen-Nürnberg

Georg Hager Friedrich-Alexander-Universität Erlangen-Nürnberg

J. Thies Deutsches Zentrum für Luft- und Raumfahrt e.V. (DLR)

Gerhard Wellein Friedrich-Alexander-Universität Erlangen-Nürnberg

Affiliation

External organisation

Matrix multiplication GPU Tall & skinny

More Info

expand_more

To reference this document use:

http://resolver.tudelft.nl/uuid:1746d8c6-0e30-439c-b5b3-3bfa0cc64416

Published Date

2020

Language

English

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Affiliation

External organisation

Abstract

General matrix-matrix multiplications (GEMM) in vendor-supplied BLAS libraries are best optimized for square matrices but often show bad performance for tall & skinny matrices, which are much taller than wide. Nvidia’s current CUBLAS implementation delivers only a fraction of the potential performance (as given by the roofline model) in this case. We describe the challenges and key properties of an implementation that can achieve perfect performance. We further evaluate different approaches of parallelization and thread distribution, and devise a flexible, configurable mapping scheme. A code generation approach enables a simultaneously flexible and specialized implementation with autotuning. This results in perfect performance for a large range of matrix sizes in the domain of interest, and at least 2/3 of maximum performance for the rest on an Nvidia Volta GPGPU.

Performance engineering for a tall &amp; skinny matrix multiplication kernels on GPUs

Abstract

Performance engineering for a tall & skinny matrix multiplication kernels on GPUs