260 likes | 528 Views
Sparse Matrix Dense Vector Multiplication. by Pedro A. Escallon Parallel Processing Class Florida Institute of Technology April 2002. The Problem. Improve the speed of sparse matrix - dense vector multiplication using MPI in a beowolf parallel computer. What To Improve.
E N D
Sparse Matrix Dense Vector Multiplication by Pedro A. Escallon Parallel Processing Class Florida Institute of Technology April 2002
The Problem • Improve the speed of sparse matrix - dense vector multiplication using MPI in a beowolf parallel computer.
What To Improve • Current algorithms use excessive indirect addressing • Current optimizations depend on the structure of the matrix (distribution of the nonzero elements)
Sparse Matrix Representations • Coordinate format • Compressed Sparse Row (CSR) • Compressed Sparse Column (CSC) • Modified Sparse Row (MSR)
Compressed Sparse Row (CSR) rS ndx val
CSR Code void sparseMul(int m, double *val, int *ndx, int *rS, double *x, double *y) { int i,j; for(i=0;i<m;i++) { for(j=rowStart[i];j<rS[i+1];j++) { y[i]+=(*val++)*x[*ndx++]; } } }
Goals • Eliminate indirect addressing • Remove the dependency on the distribution of the nonzero elements • Further compress the matrix storage • Most of all, to speed up the operation
Data Structure typedef struct { int rCol; double val; } dSparS_t; {rCol,val}
hdr.size Process A local_size residual < p local_size – hdr.size / p residual = hdr.size % p
Scatter A local_A local_size
Multiplication Code if( (index=local_A[0].rCol) > 0 ) local_Y[0].val = local_A[0].val * X[index]; else local_Y[0].val = local_A[0].val * X[0]; local_Y[0].rCol = -1; k=1; h=0; while(k<local_size) { while((0<(index=local_A[k].rCol)) && (k<local_size)) local_Y[h].val += local_A[k++].val * X[index]; if(k<local_size) { local_Y[h++].rCol = -index-1; local_Y[h].val = local_A[k++].val * X[0]; } } local_Y[h].rCol = local_Y[-1+h++].rCol+1; while(h < stride) local_Y[h++].rCol = -1;
Multiplication local_Y stride doamin Range local_A * = local_size X
Algorithm local_A X Y.val Y.rCol
Gather stride split element range local_Y gatherBuffer residual
Consolidation of Split Rows nCols gatherBuffer += Y residual
Results (vavasis3) vavasis3.rua - Total non-zero values: 1,683,902 - p = 10
Results (vavasis3) vavasis3.rua - Total non-zero values: 1,683,902 - p = 8 vavasis3.rua - Total non-zero values: 1,683,902 - p = 1
Results (vavasis3) vavasis3.rua - Total non-zero values: 1,683,902 - p = 4 vavasis3.rua - Total non-zero values: 1,683,902 - p = 2
Results (vavasis3) vavasis3.rua - Calculated Results
Results (bayer02) bayer02.rua - Total non-zero values: 63,679 - p = 10
Results (bayer02) bayer02.rua - Total non-zero values: 63,679 - p = 8 bayer02.rua - Total non-zero values: 63,679 - p = 1
Results (bayer02) bayer02.rua - Total non-zero values: 63,679 - p = 4 bayer02.rua - Total non-zero values: 63,679 - p = 2
Results (bayer02) bayer02.rua - Calculated Results
Conclusions • The proposed representation speeds up the matrix calculation • Data mismatch solution before gather should be improved • There seems to be a communication penalty for using moving structured data
Bibliography • “Optimizing the Performance of Sparse Matrix-Vector Multiplication” dissertation by Eun-Jin Im. • “Iterative Methods for Sparse Linear Systems” by Yousef Saad • “Users’ Guide for the Harwell-Boeing Sparse Matrix Collection” by Iain S. Duff