260 likes | 761 Views
Topic Overview. Matrix-Matrix Multiplication Block Matrix Operations A Simple Parallel Matrix-Matrix Multiplication Cannon's Algorithm Overlapping Communication with Computation. Matrix-Matrix Multiplication .
E N D
Topic Overview • Matrix-Matrix Multiplication • Block Matrix Operations • A Simple Parallel Matrix-Matrix Multiplication • Cannon's Algorithm • Overlapping Communication with Computation ICS 573: High Performance Computing
Matrix-Matrix Multiplication • Building on our matrix-vector multiplication (Quinn’s Chapter 8), we now consider matrix-matrix multiplication • multiplying two n x n dense, square matrices A and B to yield the product matrix C = A x B. • For simplicity, we use the following serial algorithm: ICS 573: High Performance Computing
Block Matrix Operations • Matrix computations involving scalar algebraic operations on the matrix elements can be expressed in terms of identical operations on submatrices of the original matrix. • Such algebraic operations on the submatrices are called block matrix operations. • useful in matrix multiplication as well as in a variety of other matrix algorithms • In this view, an n x n matrix A can be regarded as a q x q array of blocks Ai,j (0 ≤ i, j < q) such that each block is an (n/q) x (n/q) submatrix. • We perform q3 matrix multiplications, each involving (n/q) x (n/q) matrices. • requiring (n/q)3 additions and multiplications ICS 573: High Performance Computing
Block Matrix Operations ICS 573: High Performance Computing
A Simple Parallel Matrix-Matrix Multiplication Algorithm • Consider two n x nmatrices A and B partitioned into p blocks Ai,j and Bi,j (0 ≤ i, j <) of size each. • Process Pi,jinitially stores Ai,j and Bi,j and computes block Ci,j of the result matrix. • Computing submatrix Ci,j requires all submatrices Ai,k and Bk,j for 0 ≤ k < . • All-to-all broadcast blocks of A along rows and B along columns. • Perform local submatrix multiplication. ICS 573: High Performance Computing
Matrix-Matrix Multiplication: Performance Analysis • The two broadcasts take time • The computation requires multiplications of sized submatrices. • The parallel run time is approximately ICS 573: High Performance Computing
Drawback of the Simple Parallel Algorithm • A major drawback of this algorithm is that it is not memory optimal • Each process has blocks of both matrices A and B at the end of each communication phase • Thus, each process requires memory • Since each block requires memory • The total memory requirement over all the processes is i.e., times the memory requirement of the sequential algorithm. ICS 573: High Performance Computing
Matrix-Matrix Multiplication: Cannon's Algorithm • Cannon's algorithm is a memory-efficient version of the simple parallel algorithm • With a total memory requirement of Q(n2) • Matrices A and B are partitioned into p square blocks as in the simple parallel algorithm • Although every process in the ith row requires all submatrices, the all-to-all broadcast can be avoided by • scheduling the computations of the processes of the ith row such that, at any given time, each process is using a different block Ai,k. • systematically rotating these blocks among the processes after every submatrix multiplication so that every process gets a fresh Ai,k after each rotation. • If an identical schedule is applied to the columns of B, then no process holds more than one block of each matrix at any time ICS 573: High Performance Computing
Communication Steps in Cannon's Algorithm ICS 573: High Performance Computing
Communication Steps in Cannon's Algorithm • First, align the blocks of A and B in such a way that each process multiplies its local submatrices: • shift submatrices Ai,j to the left (with wraparound) by i steps • shift submatrices Bi,j up (with wraparound) by j steps. • After alignment (Figure 8.3c): • Process Pi,j has submatrices and . • Perform local block matrix multiplication. • Next, each block of A moves one step left and each block of B moves one step up (again with wraparound). • Perform next block multiplication, add to partial result, repeat until all the blocks have been multiplied. ICS 573: High Performance Computing
Cannon's Algorithm: An Example • Consider the matrices to be multiplied: • Assume that these matrices are portioned into 4 square blocks as follows: • After the initial alignment, matrices A and B become: ICS 573: High Performance Computing
Cannon's Algorithm: An Example • After this alignment, process • P0,0 ends up with A0,0 and B0,0and should compute c0,0 • P0,1 ends up with A0,1 and B1,1 and should compute c0,1 • P1,0 ends up with A1,1 and B1,0 and should compute c1,0 • P1,1 ends up with A1,0 and B0,1 and should compute c1,1 • The local block matrix multiplications proceed as follows: ICS 573: High Performance Computing
Cannon's Algorithm: An Example • Shift 1: shift each block of A one step to the left and shift each block of B one step up: • Next, each process Pi,j performs block multiplication, updating Ci,j : ICS 573: High Performance Computing
Cannon's Algorithm: Performance Analysis • In the alignment step, the maximum distance over which a block shifts is , • the two shift operations require a total of time. • Each of the single-step shifts in the compute-and-shift phase of the algorithm takes time. • The computation time for multiplying matrices of size is . • The parallel time is approximately: ICS 573: High Performance Computing
MPI_Cart_shift Function • Shifting data along the dimensions of the 2-D mesh is a frequent operation in the Cannon’s algorithm • MPI provides the function MPI_Cart_shift for this purpose. int MPI_Cart_shift( MPI_Comm comm_cart,/* communicator with Cartesian structure (handle)*/ int dir, /* direction of shift (> 0: up shift, < 0: down shift) */ int s_step, /* shift size/displacement */ int *rank_source, /* rank of source process */ int *rank_dest) /* rank of destination process */ • Here is an example program exercising this function. ICS 573: High Performance Computing
Sending and Receiving Messages Simultaneously • To exchange messages, MPI provides the following function: • int MPI_Sendrecv(void *sendbuf, int sendcount, • MPI_Datatype senddatatype, int dest, int • sendtag, void *recvbuf, int recvcount, • MPI_Datatype recvdatatype, int source, • int recvtag, MPI_Comm comm, MPI_Status *status) • The arguments include arguments to the send and receive functions. • If we wish to use the same buffer for both send and receive, we can use: • int MPI_Sendrecv_replace(void *buf, int count, • MPI_Datatype datatype, int dest, • int sendtag, int source, int recvtag, • MPI_Comm comm, MPI_Status *status) • A Parallel program for Cannon’s algorithm is here. ICS 573: High Performance Computing
Overlapping Communication with Computation • Our MPI programs so far used blocking send/receive operations to perform point-to-point communication. • As discussed earlier, • a blocking send operation remains blocked until the message has been copied out of the send buffer • a blocking receive operation returns only after the message has been received and copied into the receive buffer. • In the Cannon algorithm, for example, each process blocks on MPI_Sendrecv_replace • until the specified matrix block has been sent and received by the corresponding processes. • Note that the blocks of matrices A and B do not change as they are shifted among the processors • Thus, we can overlap the transmission of these blocks with the computation for the matrix-matrix multiplication • Many recent distributed-memory parallel computers have dedicated communication controllers that can perform the transmission of messages without interrupting the CPUs. ICS 573: High Performance Computing
Non-Blocking Communication Operations • In order to overlap communication with computation, MPI provides a pair of functions for performing non-blocking send and receive operations. int MPI_Isend(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm,MPI_Request *request) int MPI_Irecv(void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Request *request) • These operations return before the operations have been completed. • Function MPI_Test tests whether or not the non-blocking send or receive operation identified by its request has finished. int MPI_Test(MPI_Request *request, int *flag, MPI_Status *status) • MPI_Wait waits for the operation to complete. An example is here. int MPI_Wait(MPI_Request *request, MPI_Status *status) ICS 573: High Performance Computing
Canon’s Algorithm using Non-Blocking Operations • Here is the parallel program for Cannon’s algorithm using nonblocking operations • Two main differences between this program and the earlier one using blocking operations: • Additional arrays a_buffers and b_buffers, are used for the blocks of A and B that are being received while the computation involving the previous blocks is performed. • in the main computational loop, it first starts the non-blocking send operations to send the locally stored blocks of A and B to the processes left and up the grid, and then starts the non-blocking receive operations to receive the blocks for the next iteration from the processes right and down the grid. • After starting these four non-blocking operations, it proceeds to perform the matrix-matrix multiplication of the blocks it currently stores. • Finally, before it proceeds to the next iteration, it uses MPI_Wait to wait for the send and receive operations to complete. ICS 573: High Performance Computing