Performance engineering for a tall & skinny matrix multiplication kernels on GPUs

Conference Paper (2020)
Author(s)

Dominik Ernst (Friedrich-Alexander-Universität Erlangen-Nürnberg)

Georg Hager (Friedrich-Alexander-Universität Erlangen-Nürnberg)

Jonas Thies (Deutsches Zentrum für Luft- und Raumfahrt (DLR))

Gerhard Wellein (Friedrich-Alexander-Universität Erlangen-Nürnberg)

Affiliation
External organisation
DOI related publication
https://doi.org/10.1007/978-3-030-43229-4_43 Final published version
More Info
expand_more
Publication Year
2020
Language
English
Affiliation
External organisation
Pages (from-to)
505-515
ISBN (print)
9783030432287
Event
13th International Conference on Parallel Processing and Applied Mathematics, PPAM 2019 (2019-09-08 - 2019-09-11), Bialystok, Poland
Downloads counter
201

Abstract

General matrix-matrix multiplications (GEMM) in vendor-supplied BLAS libraries are best optimized for square matrices but often show bad performance for tall & skinny matrices, which are much taller than wide. Nvidia’s current CUBLAS implementation delivers only a fraction of the potential performance (as given by the roofline model) in this case. We describe the challenges and key properties of an implementation that can achieve perfect performance. We further evaluate different approaches of parallelization and thread distribution, and devise a flexible, configurable mapping scheme. A code generation approach enables a simultaneously flexible and specialized implementation with autotuning. This results in perfect performance for a large range of matrix sizes in the domain of interest, and at least 2/3 of maximum performance for the rest on an Nvidia Volta GPGPU.