300 likes | 465 Views
CEES 157. CEES 157 2006. 18 registered students Efficient serial programming Parallel programming Seminars by Geoscience HPC users Projects on optimizing/parallelizing code. Parallel programing. Introduction to OpenMP Loops, parallel environment Introduction to MPI
E N D
CEES 157 2006 • 18 registered students • Efficient serial programming • Parallel programming • Seminars by Geoscience HPC users • Projects on optimizing/parallelizing code
Parallel programing • Introduction to OpenMP • Loops, parallel environment • Introduction to MPI • Parallel environment, passing data, comparisons to OpenMP
Projects • 8 Projects • 5-7 were successes • Projects were concentrated in imaging and fluid flow • Main benefit was writing more efficient serial code and introducing OpenMP to their code
Loop invariant Old code: for(iy=0;iy<ny;iy++){ for(ix=0;ix<nx;ix++){ phsft=-signn*dz*(w/v_z[i2(iy,ix,nx)]-w_v0_z); wld_z[i2(iy,ix,nx)]=wld_z[i2(iy,ix,nx)]*cmplx(cos(phsft),sin(phsft)); } } signdzw=-signn*dz*w; signdzwv0z=-signn*dz*w_v0_z; for(iy=0;iy<ny;iy++){ for(ix=0;ix<nx;ix++){ phsft=signdzw/v_z[iy][ix]-signdzwv0z; wld_z[iy][ix]=wld_z[iy][ix]*cmplx(cos(phsft),sin(phsft)); } } New code:
Loop invariant for(ix=1;ix<nx-1;ix++){ url=u[ix-1]; urr=u[ix+1]; ror=cmplx(tr+b*cb[ix]*cb[ix],a*ca[ix] ); rorr=cmplx(tr+b*cb[ix]*cb[ix+1],a*ca[ix] ); rorl=cmplx(tr+b*cb[ix]*cb[ix-1],a*ca[ix] ); v[ix]=u[ix]-2.0*ror*u[ix]+rorr*urr+rorl*url; rol=conjg(ror); ra[ix]=1.0-2.0*rol; rb[ix] =cmplx(tr+b*cb[ix]*cb[ix+1],-a*ca[ix] ); rc[ix-1]=cmplx(tr+b*cb[ix]*cb[ix-1],-a*ca[ix] ); }
Loop invariant tmpbcb=b*cb[ix]; tmpaca=a*ca[ix]; for(ix=1;ix<nx-1;ix++){ url=u[ix-1]; urr=u[ix+1]; ror=cmplx(tr+tmpbcb*cb[ix],tmpaca ); rorr=cmplx(tr+tmpbcb*cb[ix+1], tmpaca ); rorl=cmplx(tr+tmpbcb*cb[ix-1], tmpaca ); v[ix]=u[ix]-2.0*ror*u[ix]+rorr*urr+rorl*url; rol=conjg(ror); ra[ix]=1.0-2.0*rol; rb[ix] =conjg(rorr ); rc[ix-1]=conjg(rorl ); }
Functional Inlining tmpbcb=b*cb[ix]; tmpaca=a*ca[ix]; for(ix=1;ix<nx-1;ix++){ url=u[ix-1]; urr=u[ix+1]; __real__ ror=tr+tmpbcb*cb[ix]; __ imag__ ror=tmpaca ; __real__rorr=tr+tmpbcb*cb[ix+1]; __imag__ rorr= tmpaca ); __real__rorl=tr+tmpbcb*cb[ix-1]; __imag__ rorl =tmpaca ); v[ix]=u[ix]-2.0*ror*u[ix]+rorr*urr+rorl*url; __real__ rol=__real__ ror; __imag__ rol=- __imag__ ror; ra[ix]=1.0-2.0*rol; __real__rb[ix] =__real__rorr; __imag__rb[ix]=- __imag__ rorr; __real__rc[ix-1]=__real__rorl; __imag__rc[ix-1]=-__imag__rorl; }
Vectorizing for(iy=0;iy<ny;iy++){ for(ix=0;ix<nx;ix++){ phsft=-signn*dz*(w/v_z[i2(iy,ix,nx)]-w_v0_z); wld_z[i2(iy,ix,nx)]=wld_z[i2(iy,ix,nx)]*cmplx(cos(phsft),sin(phsft)); } }
Vectorizing signdzw=-signn*dz*w; signdzwv0z=-signn*dz*w_v0_z; for(iy=0;iy<ny;iy++){ for(ix=0;ix<nx;ix++){ phin[ix]=signdzw/v_z[iy][ix]-signdzwv0z; } vsCos(nx,phin,cosout); vsSin(nx,phin,sinout); for(ix=0;ix<nx;ix++){ __real__ tmp_wld[iy][ix]=cosout[ix]; __imag__ tmp_wld[iy][ix]=sinout[ix]; } for(ix=0;ix<nx;ix++) wld_z[iy][ix]*=tmp_wld[iy][ix]; }
Computation time Old code: 654 seconds New code: 220 seconds
Implementation choices • Restrictions • Cartesian grids • Diagonal permeability tensors • Low-order accurate MFEM • Large memory need => OpenMP • Res. sim. models > 10e6 elements • On cartesian grids mfem solves for 3*Nx*Ny+Nx+Ny unknowns EEES 157
Parallelized code • Initializations • Several large matrices initialized (for grid, system,…) #pragma omp parallel for private(i,j) for(i=0; i < (3*nx*ny+nx+ny); i++) { for(j=0; j<(3*nx*ny+nx+ny); j++) { tangent[i][j] = 0.; } } EEES 157
Parallelized code • Grid generation • Grid data structures created in loops #pragma omp parallel for private(i,j) for(j=0; j<ny; j++) //Loop in x direction { for(i=0; i<nx; i++) //Loop in y direction { blah... blah... } } EEES 157
Parallelized code • Loop over elements • Builds tangent, residual and solution vector • Many calls to shape function routines #pragma omp parallel for private(et,i_e,i_e_2) for (et=0; et<n_dof_i; et++) //Loop over elements { for(i_e=0; i_e<4; i_e++) //Double loop over edges { for(i_e_2=0; i_e_2<4; i_e_2++) { blah… blah… } } } EEES 157
Results EEES 157
Validation • 70x70 grid • 14840 unknowns • Tangent: 220,225,600 terms EEES 157
Performance EEES 157
Old Pseudo-code For all Shots !! Parallelizable For all Frequencies For all Depths For all References U=U*SSF FFT (U) U=U*PHS IFFT(U) Interpolation end References end Depths end Frequencies end Shots jeff@sep.stanford.edu
Old Pseudo-code For all Shots (~thousands) For all Frequencies (~50-100) For all Depths (~500) For all References (~4-10) U=U*SSF FFT (U) U=U*PHS IFFT(U) Interpolation end References end Depths end Frequencies end Shots jeff@sep.stanford.edu
Unnecessary complex math !! -------------------------------------------------------------- !! SSF correction subroutine subroutine wemig_ssf(dax,w,ir,allfields,allrefs) complex :: dax(:,:),ikx2d(:,:) real :: w,dir,allfields(:,:,:),allrefs(:,:) integer :: iz,ir ikz2d =(cmplx(0., 1.)*allrefs(ir,3)* ( allfields(:,:,3 ) -allrefs(ir,3 ) )) + ( allrefs(ir,4) * w**2 * ( allfields(:,:,4 ) -allrefs(ir,4 ) )) / & sqrt( allrefs(ir,4)**2 * w**2 - allrefs(ir,10)**2 ) dax = dax * cexp( cmplx( -dz*abs(aimag(ikz2d)),-dz*real(ikz2d) )) end subroutine wemig_ssf !! -------------------------------------------------------------- jeff@sep.stanford.edu
Unnecessary complex math !! -------------------------------------------------------------- !! SSF correction subroutine subroutine wemig_ssf(dax,w,ir,allfields,allrefs) complex :: dax(:,:),ikx2d(:,:) real :: w,dir,allfields(:,:,:),allrefs(:,:) integer :: iz,ir ikz2d =(cmplx(0., 1.)*allrefs(ir,3)* ( allfields(:,:,3 ) -allrefs(ir,3 ) )) + ( allrefs(ir,4) * w**2 * ( allfields(:,:,4 ) -allrefs(ir,4 ) )) / & sqrt( allrefs(ir,4)**2 * w**2 - allrefs(ir,10)**2 ) dax = dax * cexp( cmplx( -dz*abs(aimag(ikz2d)),-dz*real(ikz2d) )) end subroutine wemig_ssf !! -------------------------------------------------------------- jeff@sep.stanford.edu
New Routine - 2.8x faster !! -------------------------------------------------------------- !! SSF correction subroutine subroutine wemig_ssf(dax,w,ir,allfields,allrefs) complex :: dax(:,:) real :: w,dir,allfields(:,:,:),allrefs(:,:), ikx2d(:,:) integer :: iz,ir ikz2d = -dz* ( allrefs(ir,4) * w**2 * ( allfields(:,:,4 ) -allrefs(ir,4 ) )) / & sqrt( allrefs(ir,4)**2 * w**2 - allrefs(ir,10)**2 ) dax = dax * cmplx( cos(ikz2d) , sin(ikz2d) ) end subroutine wemig_ssf !! -------------------------------------------------------------- jeff@sep.stanford.edu
Old Pseudo-code For all Shots (~thousands) For all Frequencies (~50-100) For all Depths (~500) For all References (~4-10) U=U*SSF FFT (U) U=U*PHS IFFT(U) Interpolation end References end Depths end Frequencies end Shots jeff@sep.stanford.edu
Old Pseudo-code New Pseudo-code For all Shots !! Parallelizable For all Frequencies For all Depths For all References SSF FFT PHS FFT Interpolation end References end Depths end Frequencies end Shots For all Shots !! Parallelizable For all Frequencies For all Depths SSF FFT For all References PHS FFT Interpolation end References end Depths end Frequencies end Shots jeff@sep.stanford.edu
New Reorganization - 2.3x faster For all Shots !! Parallelizable For all Frequencies For all Depths For all References SSF FFT PHS FFT Interpolation end References end Depths end Frequencies end Shots For all Shots !! Parallelizable For all Frequencies For all Depths SSF FFT For all References PHS FFT Interpolation end References end Depths end Frequencies end Shots jeff@sep.stanford.edu
Serial Optimization Test Results Freq Loop ncalls= 48 time= 713.116 s per =14.857 s PHS=tag ncalls= 205824 time= 451.979 s per = 2.196 ms SSF=tag ncalls= 205824 time= 215.695 s per = 1.048 ms REF=tag ncalls= 205824 time= 27.058 s per = 0.132 ms.0001315 Original Code Freq Loop ncalls= 48 time= 574.065 s per= 11.959 s PHS=tag ncalls= 205824 time= 453.302 s per= 2.202 ms SSF=tag ncalls= 205824 time= 75.414 s per= 0.366 ms REF=tag ncalls= 205824 time= 27.964 s per= 0.136 ms Subroutine Improvement Freq Loop ncalls= 71 time= 472.296 s per = 6.65 s PHS=tag ncalls= 286343 time= 92.600 s per = 1.46 ms SSF=tag ncalls= 36281 time= 16.435 s per = 0.380 ms REF=tag ncalls= 286343 time= 35.492 s per = 0.123 ms FFTs 69.5% in FFTs per frequency =4.62 s Reduce # of FFTs jeff@sep.stanford.edu
Serial Optimization Test Results Freq Loop ncalls= 48 time= 713.116 s per =14.857 s PHS=tag ncalls= 205824 time= 451.979 s per = 2.196 ms SSF=tag ncalls= 205824 time= 215.695 s per = 1.048 ms REF=tag ncalls= 205824 time= 27.058 s per = 0.132 ms.0001315 Original Code Freq Loop ncalls= 48 time= 574.065 s per= 11.959 s PHS=tag ncalls= 205824 time= 453.302 s per= 2.202 ms SSF=tag ncalls= 205824 time= 75.414 s per= 0.366 ms REF=tag ncalls= 205824 time= 27.964 s per= 0.136 ms Subroutine Improvement Freq Loop ncalls= 71 time= 472.296 s per = 6.65 s PHS=tag ncalls= 286343 time= 92.600 s per = 1.46 ms SSF=tag ncalls= 36281 time= 16.435 s per = 0.380 ms REF=tag ncalls= 286343 time= 35.492 s per = 0.123 ms FFTs 69.5% in FFTs per frequency =4.62 s Reduce # of FFTs jeff@sep.stanford.edu
CEES 157 2007 • More discussion on writing/designing efficient codes • Introduction to programming tools • Scons, make, svn • No seminars • Expansion of OpenMP