CEES 157

CEES 157

CEES 157 2006 • 18 registered students • Efficient serial programming • Parallel programming • Seminars by Geoscience HPC users • Projects on optimizing/parallelizing code

Parallel programing • Introduction to OpenMP • Loops, parallel environment • Introduction to MPI • Parallel environment, passing data, comparisons to OpenMP

Projects • 8 Projects • 5-7 were successes • Projects were concentrated in imaging and fluid flow • Main benefit was writing more efficient serial code and introducing OpenMP to their code

Loop invariant Old code: for(iy=0;iy<ny;iy++){ for(ix=0;ix<nx;ix++){ phsft=-signn*dz*(w/v_z[i2(iy,ix,nx)]-w_v0_z); wld_z[i2(iy,ix,nx)]=wld_z[i2(iy,ix,nx)]*cmplx(cos(phsft),sin(phsft)); } } signdzw=-signn*dz*w; signdzwv0z=-signn*dz*w_v0_z; for(iy=0;iy<ny;iy++){ for(ix=0;ix<nx;ix++){ phsft=signdzw/v_z[iy][ix]-signdzwv0z; wld_z[iy][ix]=wld_z[iy][ix]*cmplx(cos(phsft),sin(phsft)); } } New code:

Loop invariant for(ix=1;ix<nx-1;ix++){ url=u[ix-1]; urr=u[ix+1]; ror=cmplx(tr+b*cb[ix]*cb[ix],a*ca[ix] ); rorr=cmplx(tr+b*cb[ix]*cb[ix+1],a*ca[ix] ); rorl=cmplx(tr+b*cb[ix]*cb[ix-1],a*ca[ix] ); v[ix]=u[ix]-2.0*ror*u[ix]+rorr*urr+rorl*url; rol=conjg(ror); ra[ix]=1.0-2.0*rol; rb[ix] =cmplx(tr+b*cb[ix]*cb[ix+1],-a*ca[ix] ); rc[ix-1]=cmplx(tr+b*cb[ix]*cb[ix-1],-a*ca[ix] ); }

Loop invariant tmpbcb=b*cb[ix]; tmpaca=a*ca[ix]; for(ix=1;ix<nx-1;ix++){ url=u[ix-1]; urr=u[ix+1]; ror=cmplx(tr+tmpbcb*cb[ix],tmpaca ); rorr=cmplx(tr+tmpbcb*cb[ix+1], tmpaca ); rorl=cmplx(tr+tmpbcb*cb[ix-1], tmpaca ); v[ix]=u[ix]-2.0*ror*u[ix]+rorr*urr+rorl*url; rol=conjg(ror); ra[ix]=1.0-2.0*rol; rb[ix] =conjg(rorr ); rc[ix-1]=conjg(rorl ); }

Functional Inlining tmpbcb=b*cb[ix]; tmpaca=a*ca[ix]; for(ix=1;ix<nx-1;ix++){ url=u[ix-1]; urr=u[ix+1]; __real__ ror=tr+tmpbcb*cb[ix]; __ imag__ ror=tmpaca ; __real__rorr=tr+tmpbcb*cb[ix+1]; __imag__ rorr= tmpaca ); __real__rorl=tr+tmpbcb*cb[ix-1]; __imag__ rorl =tmpaca ); v[ix]=u[ix]-2.0*ror*u[ix]+rorr*urr+rorl*url; __real__ rol=__real__ ror; __imag__ rol=- __imag__ ror; ra[ix]=1.0-2.0*rol; __real__rb[ix] =__real__rorr; __imag__rb[ix]=- __imag__ rorr; __real__rc[ix-1]=__real__rorl; __imag__rc[ix-1]=-__imag__rorl; }

Vectorizing for(iy=0;iy<ny;iy++){ for(ix=0;ix<nx;ix++){ phsft=-signn*dz*(w/v_z[i2(iy,ix,nx)]-w_v0_z); wld_z[i2(iy,ix,nx)]=wld_z[i2(iy,ix,nx)]*cmplx(cos(phsft),sin(phsft)); } }

Vectorizing signdzw=-signn*dz*w; signdzwv0z=-signn*dz*w_v0_z; for(iy=0;iy<ny;iy++){ for(ix=0;ix<nx;ix++){ phin[ix]=signdzw/v_z[iy][ix]-signdzwv0z; } vsCos(nx,phin,cosout); vsSin(nx,phin,sinout); for(ix=0;ix<nx;ix++){ __real__ tmp_wld[iy][ix]=cosout[ix]; __imag__ tmp_wld[iy][ix]=sinout[ix]; } for(ix=0;ix<nx;ix++) wld_z[iy][ix]*=tmp_wld[iy][ix]; }

Computation time Old code: 654 seconds New code: 220 seconds

Implementation choices • Restrictions • Cartesian grids • Diagonal permeability tensors • Low-order accurate MFEM • Large memory need => OpenMP • Res. sim. models > 10e6 elements • On cartesian grids mfem solves for 3*Nx*Ny+Nx+Ny unknowns EEES 157

Parallelized code • Initializations • Several large matrices initialized (for grid, system,…) #pragma omp parallel for private(i,j) for(i=0; i < (3*nx*ny+nx+ny); i++) { for(j=0; j<(3*nx*ny+nx+ny); j++) { tangent[i][j] = 0.; } } EEES 157

Parallelized code • Grid generation • Grid data structures created in loops #pragma omp parallel for private(i,j) for(j=0; j<ny; j++) //Loop in x direction { for(i=0; i<nx; i++) //Loop in y direction { blah... blah... } } EEES 157

Parallelized code • Loop over elements • Builds tangent, residual and solution vector • Many calls to shape function routines #pragma omp parallel for private(et,i_e,i_e_2) for (et=0; et<n_dof_i; et++) //Loop over elements { for(i_e=0; i_e<4; i_e++) //Double loop over edges { for(i_e_2=0; i_e_2<4; i_e_2++) { blah… blah… } } } EEES 157

Results EEES 157

Validation • 70x70 grid • 14840 unknowns • Tangent: 220,225,600 terms EEES 157

Performance EEES 157

Old Pseudo-code For all Shots !! Parallelizable For all Frequencies For all Depths For all References U=U*SSF FFT (U) U=U*PHS IFFT(U) Interpolation end References end Depths end Frequencies end Shots jeff@sep.stanford.edu

Old Pseudo-code For all Shots (~thousands) For all Frequencies (~50-100) For all Depths (~500) For all References (~4-10) U=U*SSF FFT (U) U=U*PHS IFFT(U) Interpolation end References end Depths end Frequencies end Shots jeff@sep.stanford.edu

Unnecessary complex math !! -------------------------------------------------------------- !! SSF correction subroutine subroutine wemig_ssf(dax,w,ir,allfields,allrefs) complex :: dax(:,:),ikx2d(:,:) real :: w,dir,allfields(:,:,:),allrefs(:,:) integer :: iz,ir ikz2d =(cmplx(0., 1.)*allrefs(ir,3)* ( allfields(:,:,3 ) -allrefs(ir,3 ) )) + ( allrefs(ir,4) * w**2 * ( allfields(:,:,4 ) -allrefs(ir,4 ) )) / & sqrt( allrefs(ir,4)**2 * w**2 - allrefs(ir,10)**2 ) dax = dax * cexp( cmplx( -dz*abs(aimag(ikz2d)),-dz*real(ikz2d) )) end subroutine wemig_ssf !! -------------------------------------------------------------- jeff@sep.stanford.edu

New Routine - 2.8x faster !! -------------------------------------------------------------- !! SSF correction subroutine subroutine wemig_ssf(dax,w,ir,allfields,allrefs) complex :: dax(:,:) real :: w,dir,allfields(:,:,:),allrefs(:,:), ikx2d(:,:) integer :: iz,ir ikz2d = -dz* ( allrefs(ir,4) * w**2 * ( allfields(:,:,4 ) -allrefs(ir,4 ) )) / & sqrt( allrefs(ir,4)**2 * w**2 - allrefs(ir,10)**2 ) dax = dax * cmplx( cos(ikz2d) , sin(ikz2d) ) end subroutine wemig_ssf !! -------------------------------------------------------------- jeff@sep.stanford.edu

Old Pseudo-code For all Shots (~thousands) For all Frequencies (~50-100) For all Depths (~500) For all References (~4-10) U=U*SSF FFT (U) U=U*PHS IFFT(U) Interpolation end References end Depths end Frequencies end Shots jeff@sep.stanford.edu

Old Pseudo-code New Pseudo-code For all Shots !! Parallelizable For all Frequencies For all Depths For all References SSF FFT PHS FFT Interpolation end References end Depths end Frequencies end Shots For all Shots !! Parallelizable For all Frequencies For all Depths SSF FFT For all References PHS FFT Interpolation end References end Depths end Frequencies end Shots jeff@sep.stanford.edu

New Reorganization - 2.3x faster For all Shots !! Parallelizable For all Frequencies For all Depths For all References SSF FFT PHS FFT Interpolation end References end Depths end Frequencies end Shots For all Shots !! Parallelizable For all Frequencies For all Depths SSF FFT For all References PHS FFT Interpolation end References end Depths end Frequencies end Shots jeff@sep.stanford.edu

Serial Optimization Test Results Freq Loop ncalls= 48 time= 713.116 s per =14.857 s PHS=tag ncalls= 205824 time= 451.979 s per = 2.196 ms SSF=tag ncalls= 205824 time= 215.695 s per = 1.048 ms REF=tag ncalls= 205824 time= 27.058 s per = 0.132 ms.0001315 Original Code Freq Loop ncalls= 48 time= 574.065 s per= 11.959 s PHS=tag ncalls= 205824 time= 453.302 s per= 2.202 ms SSF=tag ncalls= 205824 time= 75.414 s per= 0.366 ms REF=tag ncalls= 205824 time= 27.964 s per= 0.136 ms Subroutine Improvement Freq Loop ncalls= 71 time= 472.296 s per = 6.65 s PHS=tag ncalls= 286343 time= 92.600 s per = 1.46 ms SSF=tag ncalls= 36281 time= 16.435 s per = 0.380 ms REF=tag ncalls= 286343 time= 35.492 s per = 0.123 ms FFTs 69.5% in FFTs per frequency =4.62 s Reduce # of FFTs jeff@sep.stanford.edu

CEES 157 2007 • More discussion on writing/designing efficient codes • Introduction to programming tools • Scons, make, svn • No seminars • Expansion of OpenMP

CEES 157

CEES 157

Presentation Transcript

Living with FASB 157

Problem 157

Beyond FAS 157

Lesson 157

Cees Jan Asselbergs managing director

Cees Jan Asselbergs managing director

Part 5: 157-164

MAT 157 Section 10.5

IDB 157 # 2

157: Facts & Tidbits

(البقرة الآية: 156-157)

Network Flare 157

FAS 157-4

157

NS0-157 Exam Dumps - Get NS0-157 Dumps PDF

NS0-157 Dumps - Real NetApp NS0-157 Exam Questions PDF

BPC-157 5mg

CEES 157

CEES 157

Presentation Transcript

Living with FASB 157

Problem 157

Beyond FAS 157

Lesson 157

Cees Jan Asselbergs managing director

Cees Jan Asselbergs managing director

Part 5: 157-164

MAT 157 Section 10.5

IDB 157 # 2

157: Facts &amp; Tidbits

(البقرة الآية: 156-157)

Network Flare 157

FAS 157-4

157

NS0-157 Exam Dumps - Get NS0-157 Dumps PDF

NS0-157 Dumps - Real NetApp NS0-157 Exam Questions PDF

BPC-157 5mg

157: Facts & Tidbits