1 / 34

GPU Sparse LU Factorization and Its Application in Circuit Simulation

GPU Sparse LU Factorization and Its Application in Circuit Simulation. Nano-scale Integrated Circuit and System Lab., EE Department, Tsinghua University Ling Ren. Abstract. First work on GPU sparse LU factorization Algorithm description: elimination graph ( EGraph )

gauri
Download Presentation

GPU Sparse LU Factorization and Its Application in Circuit Simulation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. GPUSparse LUFactorization and Its Application in Circuit Simulation Nano-scale Integrated Circuit and System Lab., EE Department, Tsinghua University Ling Ren

  2. Abstract • First work on GPU sparse LU factorization • Algorithm description: elimination graph (EGraph) • Algorithm analysis: parallelism in left-looking • Algorithm implementation: timing order on GPU • Supplement to OpenCLBLAS • Current cl_AMDBLAShas Triangular Solve but no LU • Objective of LU:

  3. Outline • Background • SparseLUfactorization • Dense LUfactorization • Summary

  4. Background • SPICE: the most popular circuit simulator • Simulating VSLI (~1 billion transistors) takes several days • Bottleneck: Sparse LU factorization • Dynamic fluids, structural, economics … Bottleneck

  5. Outline • Background • SparseLUfactorization • Dense LUfactorization • Summary

  6. Sparse LUfactorization -related works • [SuperLU 1999] • Sequential,multi-thread, distributedversions • Incorporate Supernode, efficent for dense blocks • [Pardiso2002] • Sequential, multi-thread, distributed,GPU [Christen2007] versions • AdoptSupernode • But supernodes rarely form in circuit matrices • [KLU 2010] • Optimized for circuit matrices • Only sequential, use G/P left lookingalgorithm [G/P 1988] • Adopt BTF, without Supernode

  7. Sparse LUfactorization – left-looking • Sequentially process each column • When processing column k, use all the columns on the left (1, 2, ..., k-1) to update column k. • Update = vector multiply-and-add (MAD) • read+write>arithmetic Update

  8. Algorithm description – EGraph Vector MAD • Every column is updated with several columns on its left • Nonzero structure of U determines the dependency nonzero (a) Upper triangular matrix U (b)EGraph

  9. Algorithm analysis – two kinds of parallelism • Divide columns into levels: columns in the same level are independent of each other • Cluster mode: many columns factorized in parallel • Pipeline mode: Overlapcolumns from different levels Pipeline parallelism, alone with timing order Overlapped factorization in pipeline mode Column 3 Column 1 ...... Column 2 Column 4 ...... Thread 1 Thread 2

  10. Sparse LUfactorization - workflow

  11. Sparse LUfactorization - preprocessing • Preprocessing: only once on CPU • MC64 to ensure numerical stability [MC64]; • Approximate Minimum Degree to reduce fill-ins [AMD] ; • pre-factorization (numeric factorization with partial pivoting) to calculate the symbolic structure of L and U. • Sorting the nonzeros of L and U (introduced later)

  12. Sparse LUfactorization – on GPU • GPU inputs • Location and values of nonzeros in A • Location of nonzeros in L and U • The Escheduler • GPU outputs • Values of nonzeros in L and U • CSC (Compressed Sparse Column) format for sparse matrices A, L and U

  13. Sparse LUfactorization -avoid deadlock • In traditional GPU programs, somewavefronts are inactive at the beginning (limited resource etc.). They wait for other active wavefronts to finish and then become active. • But in sparse LU, we must ensureall wavefronts are active from the beginning

  14. Sparse LUfactorization - data formats • data formats for intermediate results: dense arrays vs. CSC • CSC (Compressed Sparse Column) • Can be put in local memory • Indexed accesses inconvenient (binary search) • Using too much local memoryreduces active work-groups, which leads to severe performance loss • Dense arrays> CSCformat: 2.5x

  15. Sparse LUfactorization - data locality • Higher global memory bandwidth if consecutive work-items access consecutive address • Improve data locality • Nonzeros of L and U are out-of-order after preprocessing, sort them according to row indices • 1.7x speedup, overheads negligible • Performed only once, incorporated into preprocessing

  16. Experimental setups • CPU • 2 Xeon E5405 CPUs (8 cores in total) • 2x6 MB L2 cache, 16GB ram • GPU • AMD Radeon 5870 GPU • Testing matrices • University of Florida Sparse Matrix Collection [Davis]http://www.cise.ufl.edu/research/sparse/matrices/

  17. Sparse LUfactorization - Experimental results • GPU speedups positively related to floating point operations (flops)

  18. Sparse LUfactorization - Experimental results • Matrices divided into 4 groups • First three groups according to Mflops • GPU speedup positively related to Mflops • 4th group: denormal floating point numbers • Used to represent extremely small numbers • Very slowly on CPU, full speed support on GPU • An advantage of GPU in sparse LU and scientific computing • Very high speedups for this group

  19. Sparse LUfactorization - Experimental results • Average speedup of each group

  20. Scalability – BBD • Problem • How to use multiple GPUs? • Circuit-partition-based simulation algorithm • bordered-block-diagonal (BBD) • Diagonal blocks are factorized independently • But An becomes dense. So we need dense LU factorization

  21. Outline • Background • SparseLUfactorization • Dense LUfactorization • Summary

  22. Dense LUFactorization – blocked algorithm • Three core operations • Dense LUfactorization • Triangular matrix inversion • Matrix multiplication • Suitable for GPU • GEMM most frequent • GEMM very efficient on GPU • 920 Gflop/s (single), 290 Gflop/s (double) finished LU + inverse GEMM

  23. Dense LUFactorization – performance • 443 Gflop/s (single), 163 Gflop/s (double)

  24. Dense LUFactorization – related works • Comparison to previous studies

  25. Dense LUFactorization – further improvement • CPU BLASfor Gaussian elimination100Gflop/s • GEMM can be further improved • Scalability to multiple GPUs • Blocked dense LU: independent GEMMs on multiple GPUs • Diagonal blocks in BBD on multiple GPUs • Linear performance improvement expected

  26. Summary • First work on GPU sparse LU factorization • Exploit parallelism of left-looking algorithm • Blocked dense LUfactorization • 443 Gflop/s (single), 163 Gflop/s (double) • Supplement to OpenCLBLAS • Accelerate SPICEsimulators

  27. Reference • [SPICE] L. W. Nagel, “SPICE 2: A computer program to stimulate semiconductor circuits,” Ph.D. dissertation, University of California, Berkeley, 1975. • [SuperLU1999] J. W. Demmel, S. C. Eisenstat, J. R. Gilbert, X. S. Li, and J. W. H. Liu, “A supernodal approach to sparse partial pivoting,” SIAM J. Matrix Analysis and Applications, vol. 20, no. 3, pp. 720–755, 1999 • [Pardiso2002] O. Schenk and K. Gartner, “Solving unsymmetric sparse systems of linear equations with pardiso,” Computational Science - ICCS 2002, vol. 2330, pp. 355–363, 2002. • [G/P 1988] J. R. Gilbert and T. Peierls, “Sparse partial pivoting in time proportional to arithmetic operations,” SIAM J. Sci. Statist. Comput., vol. 9, pp. 862– 874, 1988 • [KLU2010] T. A. Davis and E. Palamadai Natarajan, “Algorithm 907: KLU, a direct sparse solver for circuit simulation problems,” ACM Trans. Math. Softw., vol. 37, pp. 36:1–36:17, September 2010.

  28. Reference • [Christen2007] M. Christen, O. Schenk, and H. Burkhart, “General-purpose sparse matrix building blocks using the nvidiacuda technology platform,” 2007. • [Davis] T. A. Davis and Y. Hu, “The university of florida sparse matrix collection,” to appear in ACM Transactions on Mathematical Software. • [Galoppo2005] N. Galoppo, N. K. Govindaraju, M. Henson, and D. Manocha, “LU-GPU: Efficient algorithms for solving dense linear systems on graphics hardware,” SC Conference, vol. 0, p. 3, 2005. • [Volkov2008] V. Volkov and J. Demmel, “LU, QR and Cholesky factorizations using vector capabilities of gpus,” EECS Department, University of California, Berkeley, Tech. Rep. UCB/EECS-2008-49, May 2008. • [Tomov2010] S. Tomov, J. Dongarra, and M. Baboulin, “Towards dense linear algebra for hybrid gpu accelerated manycore systems,” Parallel Comput., vol. 36, pp. 232–240, June 2010.

  29. Reference • [MC64] I. S. Duff and J. Koster, “The design and use of algorithms for permuting large entries to the diagonal of sparse matrices,” SIAM J. Matrix Anal. and Applics, no. 4, pp. 889–901, 1997. • [AMD] P. R. Amestoy, Enseeiht-Irit, T. A. Davis, and I. S. Duff, “Algorithm 837: AMD, an approximate minimum degree ordering algorithm,” ACM Trans. Math. Softw., vol. 30, pp. 381–388, September 2004.

  30. Thank you ! Nano-scale Integrated Circuit and System Lab., EE Department, Tsinghua University

  31. Sparse LU factorization – Terminology • Elimination Graph Definition • An edge from j to kiffU(j, k) != 0 • In the following context, node = column • Level Definition • The length of the longest path from any source node to itself. • Source nodes have no incoming edges.

  32. Sparse LUfactorization - Experimental results

  33. Dense LU factorization – Basic algorithm • Factorize to get and • Blocked LU factorization

  34. Dense LU factorization – Basic algorithm • Repeat the process to obtain , , and so on

More Related