260 likes | 485 Views
Introduction to Parallel Computing. Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar. Chapter 8 Dense Matrix Algorithms. Outline. Matrix-Vector Multiplication Matrix-Matrix Multiplication Solving a System of Linear Equations Gaussian Elimination
E N D
Introduction to Parallel Computing Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar Chapter 8Dense Matrix Algorithms
Outline • Matrix-Vector Multiplication • Matrix-Matrix Multiplication • Solving a System of Linear Equations • Gaussian Elimination • Solving a Triangular System: Back-Substitution
Review • Performance Metrics • Work: W • Parallel Time: • Total Overhead function: • Cost: Cost optimal: • Speedup: • Efficiency: • Analysis Isoefficiency: where • Get the relation of W(p) so that the equation holds • Communication – Hypercube network
Matrix-Vector Multiplication • Compute: • Serial complexity: W = O(n2)
Rowwise 1-D Partitioning • All-to-all broadcast • Local computation • Parallel Time • Cost • Cost optimal if
Rowwise 1-D Partitioning • Scalability Analysis where . • The equation holds if • Because W=O(n2) • Isoefficiency function is
2-D Partitioning • Initial Data Aligment • One-to-all broadcast • Local computation • All-to-one reduction • Parallel Time • Cost • Cost optimal if
2-D Partitioning • Scalability Analysis • The equation holds if • Because W=n2 • Isoefficiency function is • Compared with 1D Partition’s isoefficiency
Matrix-Matrix Multiplication • Compute: • Serial complexity: W=O(n3)
Simple 2D Blocking Algorithm • Each processor gets a block of block of A, B, and C • Steps • Broadcast A horizontally and B vertically • Local computation
Simple 2D Blocking Algorithm • Two All-to-All Broadcasts (in one column/row), total • Local computation • Total parallel time • Cost • Isoefficiency • It holds if • Because W=O(n3), the Isoefficiency function is • Problem • Memory consumption, after the broadcast, each process stores the entire row blocks and column blocks
Cannon Algorithm • Total steps • Data shifts between two steps in both rows and columns • Main benefit • No need to store the entire row/columns blocks • Parallel time • Isoefficiency
DNS Algorithm • Distribute A • Each i-k slice has a full A • Distribute B • Each k-j slice has a full B • Local computation • Result reduction along k direction • Result is in i-j plate • Parallel Time • Let • Isoefficiency
Gaussian Elimination • Solve Ax=b, A is nxn, x and b are nx1 vectors • Transform it into Ux=y by Gaussian Elimination • Three nested loops • Division O(n2) • Multiply-add O(n3) • Complexity • - W = O(n3)
Gaussian Elimination Outer loop k Update row k by division Update a sub matrix Require A[k,] from division step
1D Partitioningrowwise • Outer loop k • Division • One-to-all broadcast of row k • parallel updates of rows [k+1, n-1] • Analysis Communication Computation 3
Pipelined GE • Previous algorithm • outer loop k+1 starts after k finishes • Pipelined • Change broadcast to shift • loop k+1starts after the k+1 row shifts the k row’s divisor down • Analysis • Outer loop total n • Each loop: • O(n) local computation • O(n) communication • Total time Tp = O(n2) • Total cost O(n3) • Cost optimal
1D Partitioning – Block V.S.B Cyclic • Block 1-D mapping: load imbalance • Cyclic mapping: better balanced
2D Partitioning • Each processor gets a 2D block • Basic Steps • Broadcast of the “active” column along the rows • Divide step in parallel by the processors who own portions of the row • Broadcast along the columns • Rank-1 update
2D Partition Pipelined • Analysis • Pipeline in both row and columns • O(n) processing time • n2 processors • O(n3) cost • Cost optimal
2D Partition – Block V.S. Cyclic • Cyclic mapping: better load balance
Solving a Triangular System: Back-Substitution • Triangular System: Ux = y • U is an upper-unit triangular matrix • W = n2/2 multiplications and subtractions
Solving a Triangular System: Back-Substitution • 1-D rowwise partitioning to p processes • Each processes n/p rows of U, n/p of y • Using pipelining, total n steps (outer loop) • Each step: • Unit time communication, shift one x[k] • : Update a subsection of y, n/p elements • Total parallel time: , cost optimal • 2-D partitioning to processes mesh • Using pipelining (both row and column), total n steps • Each step: still require for updating y • Total parallel time: , not cost optimal • But the entire process Gaussian Elimination + Solving Triangular system is dominated by GE part W= O(n3) • The whole process is cost optimal if or