1 / 25

Introduction to Parallel Computing

Introduction to Parallel Computing. Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar. Chapter 8 Dense Matrix Algorithms. Outline. Matrix-Vector Multiplication Matrix-Matrix Multiplication Solving a System of Linear Equations Gaussian Elimination

latoya
Download Presentation

Introduction to Parallel Computing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to Parallel Computing Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar Chapter 8Dense Matrix Algorithms

  2. Outline • Matrix-Vector Multiplication • Matrix-Matrix Multiplication • Solving a System of Linear Equations • Gaussian Elimination • Solving a Triangular System: Back-Substitution

  3. Review • Performance Metrics • Work: W • Parallel Time: • Total Overhead function: • Cost: Cost optimal: • Speedup: • Efficiency: • Analysis Isoefficiency: where • Get the relation of W(p) so that the equation holds • Communication – Hypercube network

  4. Matrix-Vector Multiplication • Compute: • Serial complexity: W = O(n2)

  5. Rowwise 1-D Partitioning • All-to-all broadcast • Local computation • Parallel Time • Cost • Cost optimal if

  6. Rowwise 1-D Partitioning • Scalability Analysis where . • The equation holds if • Because W=O(n2) • Isoefficiency function is

  7. 2-D Partitioning • Initial Data Aligment • One-to-all broadcast • Local computation • All-to-one reduction • Parallel Time • Cost • Cost optimal if

  8. 2-D Partitioning • Scalability Analysis • The equation holds if • Because W=n2 • Isoefficiency function is • Compared with 1D Partition’s isoefficiency

  9. Matrix-Matrix Multiplication • Compute: • Serial complexity: W=O(n3)

  10. Simple 2D Blocking Algorithm • Each processor gets a block of block of A, B, and C • Steps • Broadcast A horizontally and B vertically • Local computation

  11. Simple 2D Blocking Algorithm • Two All-to-All Broadcasts (in one column/row), total • Local computation • Total parallel time • Cost • Isoefficiency • It holds if • Because W=O(n3), the Isoefficiency function is • Problem • Memory consumption, after the broadcast, each process stores the entire row blocks and column blocks

  12. Cannon Algorithm • Total steps • Data shifts between two steps in both rows and columns • Main benefit • No need to store the entire row/columns blocks • Parallel time • Isoefficiency

  13. DNS Algorithm • Distribute A • Each i-k slice has a full A • Distribute B • Each k-j slice has a full B • Local computation • Result reduction along k direction • Result is in i-j plate • Parallel Time • Let • Isoefficiency

  14. Gaussian Elimination • Solve Ax=b, A is nxn, x and b are nx1 vectors • Transform it into Ux=y by Gaussian Elimination • Three nested loops • Division O(n2) • Multiply-add O(n3) • Complexity • - W = O(n3)

  15. Gaussian Elimination Outer loop k Update row k by division Update a sub matrix Require A[k,] from division step

  16. 1D Partitioningrowwise • Outer loop k • Division • One-to-all broadcast of row k • parallel updates of rows [k+1, n-1] • Analysis Communication Computation 3

  17. Pipelined GE • Previous algorithm • outer loop k+1 starts after k finishes • Pipelined • Change broadcast to shift • loop k+1starts after the k+1 row shifts the k row’s divisor down • Analysis • Outer loop total n • Each loop: • O(n) local computation • O(n) communication • Total time Tp = O(n2) • Total cost O(n3) • Cost optimal

  18. 1D Partitioning – Block Mapping

  19. 1D Partitioning – Block V.S.B Cyclic • Block 1-D mapping: load imbalance • Cyclic mapping: better balanced

  20. 2D Partitioning • Each processor gets a 2D block • Basic Steps • Broadcast of the “active” column along the rows • Divide step in parallel by the processors who own portions of the row • Broadcast along the columns • Rank-1 update

  21. 2D Partition Pipelined • Analysis • Pipeline in both row and columns • O(n) processing time • n2 processors • O(n3) cost • Cost optimal

  22. 2D Partition – Block Mapping

  23. 2D Partition – Block V.S. Cyclic • Cyclic mapping: better load balance

  24. Solving a Triangular System: Back-Substitution • Triangular System: Ux = y • U is an upper-unit triangular matrix • W = n2/2 multiplications and subtractions

  25. Solving a Triangular System: Back-Substitution • 1-D rowwise partitioning to p processes • Each processes n/p rows of U, n/p of y • Using pipelining, total n steps (outer loop) • Each step: • Unit time communication, shift one x[k] • : Update a subsection of y, n/p elements • Total parallel time: , cost optimal • 2-D partitioning to processes mesh • Using pipelining (both row and column), total n steps • Each step: still require for updating y • Total parallel time: , not cost optimal • But the entire process Gaussian Elimination + Solving Triangular system is dominated by GE part W= O(n3) • The whole process is cost optimal if or

More Related