1 / 78

James Demmel UC Berkeley

The Future of Numerical Linear Algebra Automatic Performance Tuning of Sparse Matrix codes The Next LAPACK and ScaLAPACK www.cs.berkeley.edu/~demmel/Utah_Apr05.ppt. James Demmel UC Berkeley. Outline. Automatic Performance Tuning of Sparse Matrix Kernels Updating LAPACK and ScaLAPACK.

lotus
Download Presentation

James Demmel UC Berkeley

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Future of Numerical Linear AlgebraAutomatic Performance Tuning of Sparse Matrix codesThe Next LAPACK and ScaLAPACKwww.cs.berkeley.edu/~demmel/Utah_Apr05.ppt James Demmel UC Berkeley

  2. Outline • Automatic Performance Tuning of Sparse Matrix Kernels • Updating LAPACK and ScaLAPACK

  3. Outline • Automatic Performance Tuning of Sparse Matrix Kernels • Updating LAPACK and ScaLAPACK

  4. Berkeley Benchmarking and OPtimization (BeBOP) • Prof. Katherine Yelick • Rich Vuduc • Many results in this talk are from Vuduc’s PhD thesis, www.cs.berkeley.edu/~richie • Rajesh Nishtala, Mark Hoemmen, Hormozd Gahvari • Eun-Jim Im, many other earlier contributors • Other papers at bebop.cs.berkeley.edu

  5. Motivation for Automatic Performance Tuning • Writing high performance software is hard • Make programming easier while getting high speed • Ideal: program in your favorite high level language (Matlab, PETSc…) and get a high fraction of peak performance • Reality: Best algorithm (and its implementation) can depend strongly on the problem, computer architecture, compiler,… • Best choice can depend on knowing a lot of applied mathematics and computer science • How much of this can we teach?

  6. Motivation for Automatic Performance Tuning • Writing high performance software is hard • Make programming easier while getting high speed • Ideal: program in your favorite high level language (Matlab, PETSc…) and get a high fraction of peak performance • Reality: Best algorithm (and its implementation) can depend strongly on the problem, computer architecture, compiler,… • Best choice can depend on knowing a lot of applied mathematics and computer science • How much of this can we teach? • How much of this can we automate?

  7. Examples of Automatic Performance Tuning • Dense BLAS • Sequential • PHiPAC (UCB), then ATLAS (UTK) • Now in Matlab, many other releases • math-atlas.sourceforge.net/ • Fast Fourier Transform (FFT) & variations • FFTW (MIT) • Sequential and Parallel • 1999 Wilkinson Software Prize • www.fftw.org • Digital Signal Processing • SPIRAL: www.spiral.net (CMU) • MPI Collectives (UCB, UTK) • More projects, conferences, government reports, …

  8. Tuning Dense BLAS —PHiPAC

  9. Tuning Dense BLAS– ATLAS Extends applicability of PHIPAC; Incorporated in Matlab (with rest of LAPACK)

  10. How tuning works, so far • What do dense BLAS, FFTs, signal processing, MPI reductions have in common? • Can do the tuning off-line: once per architecture, algorithm • Can take as much time as necessary (hours, a week…) • At run-time, algorithm choice may depend only on few parameters • Matrix dimension, size of FFT, etc.

  11. Register Tile Size Selection inDense Matrix Multiply m k m k0 m0 m0 n0 k n0 n . k0 = n

  12. Tuning Register Tile Sizes (Dense Matrix Multiply) 333 MHz Sun Ultra 2i 2-D slice of 3-D space; implementations color-coded by performance in Mflop/s 16 registers, but 2-by-3 tile size fastest Needle in a haystack

  13. 90% of implementations perform at < 33% of peak .4% of implementations perform at >= 80% of peak 99% of implementations perform at < 66% of peak

  14. Limits of off-line tuning • Algorithm and its implementation may strongly depend on data only known at run-time • Ex: Sparse matrix nonzero pattern determines both best data structure and implementation of Sparse-matrix-vector-multiplication (SpMV) • Can’t afford to generate and test thousands of algorithms and implementations at run-time! • BEBOP project addresses sparse tuning

  15. A Sparse Matrix You Use Every Day

  16. Motivation for Automatic Performance Tuning of SpMV • SpMV widely used in practice • Kernel of iterative solvers for • linear systems • eigenvalue problems • Singular value problems • Historical trends • Sparse matrix-vector multiply (SpMV): 10% of peak or less • 2x faster than CSR with “hand-tuning” • Tuning becoming more difficult over time

  17. SpMV Historical Trends: Fraction of Peak

  18. Approach to Automatic Performance Tuning of SpMV • Our approach: empirical modeling and search • Off-line: measure performance of variety of data structures and SpMV algorithms • On-line: sample matrix, use performance model to predict which data structure/algorithm is best • Results • Up to 4x speedups and 31% of peak for SpMV • Using register blocking • Many other optimization techniques for SpMV

  19. SpMV with Compressed Sparse Row (CSR) Storage Matrix-vector multiply kernel: y(i) y(i) + A(i,j)*x(j) for each row i for k=ptr[i] to ptr[i+1] do y[i] = y[i] + val[k]*x[ind[k]] Matrix-vector multiply kernel: y(i) y(i) + A(i,j)*x(j) for each row i for k=ptr[i] to ptr[i+1] do y[i] = y[i] + val[k]*x[ind[k]]

  20. Example 1: The Difficulty of Tuning • n = 21216 • nnz = 1.5 M • kernel: SpMV • Source: NASA structural analysis problem

  21. Example 1: The Difficulty of Tuning • n = 21216 • nnz = 1.5 M • kernel: SpMV • Source: NASA structural analysis problem • 8x8 dense substructure

  22. Taking advantage of block structure in SpMV • Bottleneck is time to get matrix from memory • Only 2 flops for each nonzero in matrix • Goal: decrease size of data structure • Don’t store each nonzero with index, instead store each nonzero r-by-c block with one index • Storage drops by up to 2x (if rc >> 1, all 32-bit quantities) • Time to fetch matrix from memory decreases • Change both data structure and algorithm • Need to pick r and c • Need to change algorithm accordingly • In example, is r=c=8 best choice? • Minimizes storage, so looks like a good idea…

  23. Best: 4x2 Reference Speedups on Itanium 2: The Need for Search Mflop/s Mflop/s

  24. Power3 - 13% 195 Mflop/s Power4 - 14% 703 Mflop/s SpMV Performance (Matrix #2): Generation 1 100 Mflop/s 469 Mflop/s Itanium 1 - 7% Itanium 2 - 31% 225 Mflop/s 1.1 Gflop/s 103 Mflop/s 276 Mflop/s

  25. Register Profile: Itanium 2 1190 Mflop/s 190 Mflop/s

  26. Power3 - 17% 252 Mflop/s Power4 - 16% 820 Mflop/s Register Profiles: IBM and Intel IA-64 122 Mflop/s 459 Mflop/s Itanium 1 - 8% Itanium 2 - 33% 247 Mflop/s 1.2 Gflop/s 107 Mflop/s 190 Mflop/s

  27. Ultra 2i - 11% 72 Mflop/s Ultra 3 - 5% 90 Mflop/s Register Profiles: Sun and Intel x86 35 Mflop/s 50 Mflop/s Pentium III - 21% Pentium III-M - 15% 108 Mflop/s 122 Mflop/s 42 Mflop/s 58 Mflop/s

  28. Example 2: The Difficulty of Tuning • n = 21216 • nnz = 1.5 M • kernel: SpMV • Source: NASA structural analysis problem

  29. Zoom in to top corner • More complicated non-zero structure in general

  30. 3x3 blocks look natural, but… • More complicated non-zero structure in general • Example: 3x3 blocking • Logical grid of 3x3 cells

  31. 3x3 blocks look natural, but… • More complicated non-zero structure in general • Example: 3x3 blocking • Logical grid of 3x3 cells • But would lead to lots of “fill-in”: 1.5x

  32. Extra Work Can Improve Efficiency! • More complicated non-zero structure in general • Example: 3x3 blocking • Logical grid of 3x3 cells • Fill-in explicit zeros • Unroll 3x3 block multiplies • “Fill ratio” = 1.5 • On Pentium III: 1.5x speedup! • Actual mflop rate 1.52 = 2.25 higher

  33. Automatic Register Block Size Selection • Selecting the r x c block size • Off-line benchmark • Precompute Mflops(r,c) using dense A for each r x c • Once per machine/architecture • Run-time “search” • Sample A to estimate Fill(r,c) for each r x c • Run-time heuristic model • Choose r, c to minimize time  Fill(r,c) /Mflops(r,c)

  34. Accurate and Efficient Adaptive Fill Estimation • Idea: Sample matrix • Fraction of matrix to sample: sÎ [0,1] • Cost ~ O(s * nnz) • Control cost by controlling s • Search at run-time: the constant matters! • Control s automatically by computing statistical confidence intervals • Idea: Monitor variance • Cost of tuning • Heuristic: costs 1 to 11 unblocked SpMVs • Converting matrix costs 5 to 40 unblocked SpMVs • Tuning a good idea when doing lots of SpMVs

  35. Test Matrix Collection • Many on-line sources (see Vuduc’s thesis) • Matrix 1 – dense (in sparse format) • Matrices 2-9: FEM with one block size r x c • N from 14K to 62K, NNZ from 1M to 3M • Fluid flow, structural mechanics, materials … • Matrices 10-17: FEM with multiple block sizes • N from 17K to 52K, NNZ from .5M to 2.7M • Fluid flow, buckling, … • Matrices 18 – 37: “Other” • N from 5K to 75K, NNZ from 33K to .5M • Power grid, chem eng, finance, semiconductors, … • Matrices 40 – 44: Linear Programming • (N,M) from (3K,13K) to (15K,77K), NNZ from 50K to 2M

  36. Accuracy of the Tuning Heuristics (1/4) See p. 375 of Vuduc’s thesis for matrices NOTE: “Fair” flops used (ops on explicit zeros not counted as “work”)

  37. Accuracy of the Tuning Heuristics (2/4)

  38. Accuracy of the Tuning Heuristics (2/4) DGEMV

  39. Evaluating algorithms and machines for SpMV • Some speedups look good, but could we do better? • Questions • What is the best speedup possible? • Independent of instruction scheduling, selection • Can SpMV be further improved or not? • What machines are “good” for SpMV? • How can architectures be changed to improve SpMV?

  40. Upper Bounds on Performance for register blocked SpMV • P = (flops) / (time) • Flops = 2 * nnz(A) … don’t count extra work on zeros • Lower bound on time: Two main assumptions • 1. Count memory ops only (streaming) • 2. Count only compulsory, capacity misses: ignore conflicts • Account for line sizes • Account for matrix size and nnz • Charge minimum access “latency” ai at Li cache & amem • e.g., Saavedra-Barrera and PMaC MAPS benchmarks

  41. Example: L2 Misses on Itanium 2 Misses measured using PAPI [Browne ’00]

  42. Example: Bounds on Itanium 2

  43. Example: Bounds on Itanium 2

  44. Example: Bounds on Itanium 2

  45. Fraction of Upper Bound Across Platforms

  46. Summary of Other Performance Optimizations • Optimizations for SpMV • Register blocking (RB): up to 4x over CSR • Variable block splitting: 2.1x over CSR, 1.8x over RB • Diagonals: 2x over CSR • Reordering to create dense structure + splitting: 2x over CSR • Symmetry: 2.8x over CSR, 2.6x over RB • Cache blocking: 2.8x over CSR • Multiple vectors (SpMM): 7x over CSR • And combinations… • Sparse triangular solve • Hybrid sparse/dense data structure: 1.8x over CSR • Higher-level kernels • AAT*x, ATA*x: 4x over CSR, 1.8x over RB • A2*x: 2x over CSR, 1.5x over RB

  47. Raefsky4 (structural problem) + SuperLU + colmmd N=19779, nnz=12.6 M Dense trailing triangle: dim=2268, 20% of total nz Can be as high as 90+%! 1.8x over CSR Example: Sparse Triangular Factor

  48. “axpy” dot product Cache Optimizations for AAT*x • Cache-level: Interleave multiplication by A, AT … … • Register-level: aiT to be r´c block row, or diag row • Algorithmic-level transformations for A2*x, A3*x, …

  49. Impact on Applications: Omega3P • Application: accelerator cavity design [Ko] • Relevant optimization techniques • Symmetric storage • Register blocking • Reordering rows and columns to improve blocking • Reverse Cuthill-McKee ordering to reduce bandwidth • Traveling Salesman Problem-based ordering to create blocks • [Pinar & Heath ’97] • Make columns adjacent if they have many common nonzero rows • 2.1x speedup on Power 4

  50. Source: Accelerator Cavity Design Problem (Ko via Husbands)

More Related