1 / 22

Accelerating SYMV kernel on GPUs

Accelerating SYMV kernel on GPUs. Ahmad M Ahmad, AMCS Division, KAUST ahmad.ahmad@kaust.edu.sa. Agenda. Motivation GPU Technology GPU Optimization issues MAGMA SYMV kernel The new SYMV Kernel Performance Results What Helped us? Future Work. Motivation.

purity
Download Presentation

Accelerating SYMV kernel on GPUs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Accelerating SYMV kernel on GPUs Ahmad M Ahmad, AMCS Division, KAUST ahmad.ahmad@kaust.edu.sa

  2. Agenda • Motivation • GPU Technology • GPU Optimization issues • MAGMA SYMV kernel • The new SYMV Kernel • Performance Results • What Helped us? • Future Work

  3. Motivation • GPUs are invading HPC community. • Many cores (~512) on a single GPU card. • Best suited for massively (embarrassingly) parallel problem. • Unlike CPUs, dedicate more silicon to floating point operation. • Unlike CPUs, consume much less power. • Three of the top 5 supercomputers are heterogeneous (CPUs + GPUs) • The world’s biggest supercomputer to be built will have 18,000 GPUs • To get high performance, it is quite a challenge

  4. GPU Technology (Fermi) SM SM SM L2-Cache DRAM

  5. GPU Technology(Fermi) • For each SM • 32 cores. • 64K L1/SHMEM • 16 LS/ST units • 4 SFUs • 32768 registers (32-bits)

  6. GPU Technology (Fermi) • Fermi GPUs are the first GPU in the world with complete memory hierarchy • (registers, L1 cache/SHMEM, L2 cache, DRAM) • Fermi is the first GPU with ECC support. • Fermi theoretical peak performance: • 1 Tflop/s (single precision) • ~ 500 Gflop/s (double precision)

  7. GPU Technology • Why is it tough? Let’s take a look at the programming model… A user program is designed as a grid of computation blocks Each block occupies one SM and has dedicated local memory Blocks share L2 cache and Global Memory

  8. GPU Technology A single computation block is divided in threads in 1D, 2D, or 3D arrays • Why is it tough? Let’s take a look at the programming model… Commonly known as Thread Block Threads are executed in warps (groups of 32)

  9. GPU Optimization Issues • General • Load balancing between computational blocks. • Data caching for reused data. • Data prefetching (to mask memory latency). • Avoid going to SLOW global memory as much as possible • Memory coalesced access (per warp) • GPU Specific • Avoid shared memory bank conflict. • Avoid divergent branches (within same warp). • Avoid using many registers per thread (63 in Fermi). • Wisely use SM resources to increase occupancy (since one SM can host more than one computation block simultaneously)

  10. The SYMV Kernel • A level-2 BLAS kernel • Compute:Y = α × A × X + β × Y • A is a symmetric matrix (S-D-C-Z) • X and Y are vectors • Α and β are scalars • Only lower/upper side of A should be referenced. • The operation of mat-vec multiplication involve data reuse in the vector X only. • No data reuse can be exploited regarding the elements of matrix A (except for symmetry).

  11. MAGMA SYMV Kernel (SC’11 paper) • Main ideas • Matrix is divided into 64×64 sub-matrices. • Each computation block is responsible for one horizontal row of submatrices. • A computation block starts by the diagonal sub-matrix of the assigned row. • Non diagonal sub-matrices are regarded twice • One for non-transposed sub-matrix. • Second for transposed sub-matrix to exploit symmetry. • Recursive Blocking • Used to save shared memory. • Each sub-matrix is processed in 32×32 chunks • Pointer Re-directing • Used to handle non 64X matrix dimension

  12. MAGMA SYMV Kernel Spelled to GLMEM for other blocks Reduction through GLMEM – computed by other blocks Reduction through SHMEM/REG + + + + + +

  13. Main Ideas of our Design • Same 64×64 block size as MAGMA • Diagonal Blocks are isolated from non-diagonal ones. • Each computation block is responsible for one vertical column of submatrices, offering better use of locality for column major format. • No Recursive Blocking • Fermi has enough shared memory (up to 48K). • Allows more efficient data prefetching (in diagonal submatrices) • Shared memory usage is restricted to reduction operation only • In Fermi, SHMEM latency is high (compared to previous GPUs) • In MAGMA, SHMEM is used in reduction as well as storing partial results • In the new design, partial results are accumulated in registers first, and spelled once to shared memory for reduction.

  14. The new SYMV kernel Reduction through GLMEM-computed by other blocks + + Spelled to GLMEM for other blocks + Reduction through SHMEM/REG + + +

  15. Experiments • The new kernel • was written in CUDA C ver4.0 • was integrated into MAGMA/BLAS for testing. • is, so far, designed for 64X matrix dimension. We plan to use either pointer redirecting (same as MAGMA) or padding (easier for implementation and fast release). • is tested on Fermi (Tesal C2070) GPU with 6 GB of memory

  16. Performance Results “cont.”

  17. Performance Results

  18. What helped us? • PAPI CUDA Component • Extract performance counters during kernel execution. • Really easy to use (even for a first time user)! • Mainly used to identify where possible improvements are possible. • Shared memory bank conflict • Global memory misses (load/store) • Divergent branches • Local memory usage.

  19. What helped us? “cont.” • NVIDIA compute profiler • Extract information unavailable/hard to get through PAPI CUDA component. • Registers per thread. • GPU time • Occupancy analysis • Kernel memory bandwidth

  20. Future Work • Distribution of work among computation blocks is not balanced. • Balancing load may lead to further improvement, but locality will not be exploited. • 1D Block cyclic assignment is intended 0 0 1 1 0 2 2 1 4 3 3 2 0 2 4 4 3 1 3 4 5

  21. Credits • RajibNath (University of California, San Diego) • Fruitful discussion about the design of the MAGMA SYMV kernel. • Guidelines for possible improvements. • Heike Jagode (UTK) • Guidelines installation/usage of PAPI

  22. Thank YouQuestion?

More Related