1 / 22

High Performance Computing Code for Pyroclastic Flows Simulations

High Performance Computing Code for Pyroclastic Flows Simulations. Carlo Cavazzoni , Giovanni Erbacci, Silvia Baseggio (High Performance Computing Group – CINECA) Tomaso Esposti Ongaro, Augusto Neri (INGV). Outline. Pyroclastic Dispersion Analysis Code (pdac) Fortran90 implementation

Download Presentation

High Performance Computing Code for Pyroclastic Flows Simulations

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. High Performance Computing Code for Pyroclastic Flows Simulations Carlo Cavazzoni, Giovanni Erbacci, Silvia Baseggio(High Performance Computing Group – CINECA) Tomaso Esposti Ongaro, Augusto Neri (INGV)

  2. Outline • Pyroclastic Dispersion Analysis Code (pdac) • Fortran90 implementation • Parallel implementation • 3D simulations • Performances, Optimizations • Conclusions

  3. EXPLORIS Explosive Eruption Risk and Decision Support for EU Populations Threatened by Volcanoes http://exploris.pi.ingv.it/

  4. Pyroclastic Flows Montserrat Mounth St. Helens Vesuvius

  5. Fluidodynamic simulation of a multiphase, multicomponent compressible fluid rectangular non-uniform grid, topography, gas and particle fields

  6. Time evolution s(t+Dt) = F( s( t ), s( t-Dt ), .. ) System status at time t+Dt is a Function of the System at previous time values t, t-Dt, ecc… Usually we use 1st order scheme, dependence only on s(t)

  7. PDAC - Simplified Code Flow Chart start compute physical quantities at time t read input print output setup compute explicit fluxes t = t + Dt iterative solver compute enthalpy stop t = tstop ? yes no

  8. Code Restructuring code rewritten using F90 with a modular approach drawback less performance get rid of the complexity more general and readable code parallelization use many CPUs code ported to many architectures optimizations recover F77 efficiency from 2D to 3D

  9. Use of F90 / Modular Code • Better control over data structures • Easier to add new features / extensions • More readable • Easier to mantain • less performances respect F77

  10. SUBROUTINE BETAS C INCLUDE 'pdac2d.com' REAL*8 LEFT C PARAMETER (DELG=1.D-8) DO 10 J=2,JB1 DO 10 I=2,IB1 IJ=I+(J-1)*IB2 IF(FL(IJ).NE.1) GOTO 10 CALL SUBSC DRP=DR(I)+DR(I+1) DRM=DR(I)+DR(I-1) DZP=DZ(J)+DZ(J+1) DZM=DZ(J)+DZ(J-1) RAGS=ROG(IJ)/P(IJ) RIGHT=1.D0 LEFT=1.D0 TOP=1.D0 BOT=1.D0 NFLR=FL(IPJ) NFLL=FL(IMJ) NFLT=FL(IJP) NFLB=FL(IJM) GOTO (2,1,1,2,1),NFLR 1 RIGHT=0.D0 2 GOTO (4,3,3,4,3),NFLL 3 LEFT=0.D0 4 GOTO (6,5,5,6,5),NFLT 5 TOP=0.D0 6 GOTO (8,7,7,8,7),NFLB 7 BOT=0.D0 8 CONTINUE CONV(IJ)=DELG*RGP(IJ) RBETA=EP(IJ)*RAGS+DT*DT/DZ(J) $ *((DZ(J+1)*EP(IJ)+DZ(J)*EP(IJT))/DZP/DZP*2.D0*TOP+ $ (DZ(J-1)*EP(IJ)+DZ(J)*EP(IJB))/DZM/DZM*2.D0*BOT)+ $ DT*DT/R(I)/DR(I) $ *(RB(I)*(DR(I+1)*EP(IJ)+DR(I)*EP(IJR))/DRP/DRP*2.D0*RIGHT $ +RB(I-1)*(DR(I-1)*EP(IJ)+DR(I)*EP(IJL))/DRM/DRM*2.D0*LEFT) ABETA(IJ)=1.D0/RBETA 10 CONTINUE C RETURN END C Typical F77 Subroutine note the unreadable goto!! Its hard to understand what the code does

  11. Typical F90 Subroutine include commons replaced by USE SUBROUTINE betas ! USE grid, ONLY: fl USE dimensions USE eosg_module, ONLY: rags USE gas_solid_density, ONLY: rog, rgp USE grid, ONLY: r, rb, dr, dz, ib2, ib1, jb1 USE pressure_epsilon, ONLY: p, ep USE subsc_module, ONLY: subsc,ipj,imj,ijp,ijm,ijr,ijl,ijt,ijb USE time_PARAMETERs, ONLY: dt, time IMPLICIT NONE ! REAL*8 :: rbtop, rbleft, rbright, rbbot, rbeta REAL*8 :: dt2r, dt2z, dzp, drm, drp, dzm REAL*8 :: indrp, indrm, indzp, indzm INTEGER :: nflb, nflt, nfll, nflr INTEGER :: i, j, ij ! REAL*8, PARAMETER :: delg=1.D-8 DO j=2,jb1 DO i=2,ib1 ij=i+(j-1)*ib2 IF(fl(ij).EQ.1) THEN CALL subsc(ij) drp=dr(i)+dr(i+1) drm=dr(i)+dr(i-1) dzp=dz(j)+dz(j+1) dzm=dz(j)+dz(j-1) ! indrp=1.D0/drp indrm=1.D0/drm indzp=1.D0/dzp indzm=1.D0/dzm nflr=fl(ipj) nfll=fl(imj) nflt=fl(ijp) nflb=fl(ijm) IF( (nflr .NE. 1) .AND. (nflr .NE. 4) ) THEN rbright = 0.D0 ELSE rbright = ( dr(i+1)*ep(ij)+dr(i)*ep(ijr))*indrp*indrp*2.D0 END IF IF( (nfll .NE. 1) .AND. (nfll .NE. 4) ) THEN rbleft = 0.D0 ELSE rbleft = ( dr(i-1)*ep(ij)+dr(i)*ep(ijl) )*indrm*indrm*2.D0 END IF IF( (nflt .NE. 1) .AND. (nflt .NE. 4) ) THEN rbtop = 0.0D0 ELSE rbtop = ( dz(j+1)*ep(ij)+dz(j)*ep(ijt) ) *indzp*indzp*2.D0 END IF IF( (nflb .NE. 1) .AND. (nflb .NE. 4) ) THEN rbbot=0.D0 ELSE rbbot = ( dz(j-1)*ep(ij)+dz(j)*ep(ijb) ) *indzm*indzm*2.D0 END IF conv(ij) = delg * rgp(ij) dt2z = dt * dt * indz(j) dt2r = dt * dt * inr(i) * indr(i) rags = rog(ij) / p(ij) rbeta = ep(ij) * rags + $ dt2z * ( rbtop + rbbot ) + $ dt2r * ( rb(i) * rbright + rb(i-1) * rbleft ) abeta(ij) = 1.D0 / rbeta END IF END DO END DO RETURN END SUBROUTINE goto replaced by structured if statements the subroutine is longer, but by far more readable

  12. Parallel Code • Domain decomposition / MPI • Control/Mapping data structures • hiding of communication controls • code I/O redesigned • overhead and less performances

  13. Parallelization Strategies The main strategy is to define sub grids and distribute them to the available processors intermediate results on local data need to be exchanged across domain boundaries the solve the equations defined in the global domain. Block like distribution less overall communication layer like distribution, comunication only with nearest neighbours

  14. Efficiency of the 2D Parallel Code coarse grid = 45 * 35 fine grid = 225 * 150

  15. From 2D to 3D • Extension (2D->3D) of data structures • indexes • Subroutines and Modules generalized • 2D simulations mantained in 3D codes

  16. ipjk = myijk( ip1_jp0_kp0_ , ijk ) imjk = myijk( im1_jp0_kp0_ , ijk ) ippjk = myijk( ip2_jp0_kp0_ , ijk ) immjk = myijk( im2_jp0_kp0_ , ijk ) ijpk = myijk( ip0_jp1_kp0_ , ijk ) ipjpk = myijk( ip1_jp1_kp0_ , ijk ) imjpk = myijk( im1_jp1_kp0_ , ijk ) ijmk = myijk( ip0_jm1_kp0_ , ijk ) ipjmk = myijk( ip1_jm1_kp0_ , ijk ) imjmk = myijk( im1_jm1_kp0_ , ijk ) ijppk = myijk( ip0_jp2_kp0_ , ijk ) ijmmk = myijk( ip0_jm2_kp0_ , ijk ) ijkp = myijk( ip0_jp0_kp1_ , ijk ) ipjkp = myijk( ip1_jp0_kp1_ , ijk ) imjkp = myijk( im1_jp0_kp1_ , ijk ) ijpkp = myijk( ip0_jp1_kp1_ , ijk ) ijmkp = myijk( ip0_jm1_kp1_ , ijk ) ijkm = myijk( ip0_jp0_km1_ , ijk ) ipjkm = myijk( ip1_jp0_km1_ , ijk ) imjkm = myijk( im1_jp0_km1_ , ijk ) ijpkm = myijk( ip0_jp1_km1_ , ijk ) ijmkm = myijk( ip0_jm1_km1_ , ijk ) ijkpp = myijk( ip0_jp0_kp2_ , ijk ) ijkmm = myijk( ip0_jp0_km2_ , ijk ) ijke = myinds( ip1_jp0_kp0_ , ijk ) ijkw = myinds( im1_jp0_kp0_ , ijk ) ijkee = myinds( ip2_jp0_kp0_ , ijk ) ijkww = myinds( im2_jp0_kp0_ , ijk ) ijkn = myinds( ip0_jp1_kp0_ , ijk ) ijken = myinds( ip1_jp1_kp0_ , ijk ) ijkwn = myinds( im1_jp1_kp0_ , ijk ) ijks = myinds( ip0_jm1_kp0_ , ijk ) ijkes = myinds( ip1_jm1_kp0_ , ijk ) ijkws = myinds( im1_jm1_kp0_ , ijk ) ijknn = myinds( ip0_jp2_kp0_ , ijk ) ijkss = myinds( ip0_jm2_kp0_ , ijk ) ijkt = myinds( ip0_jp0_kp1_ , ijk ) ijket = myinds( ip1_jp0_kp1_ , ijk ) ijkwt = myinds( im1_jp0_kp1_ , ijk ) ijknt = myinds( ip0_jp1_kp1_ , ijk ) ijkst = myinds( ip0_jm1_kp1_ , ijk ) ijkb = myinds( ip0_jp0_km1_ , ijk ) ijkeb = myinds( ip1_jp0_km1_ , ijk ) ijkwb = myinds( im1_jp0_km1_ , ijk ) ijknb = myinds( ip0_jp1_km1_ , ijk ) ijksb = myinds( ip0_jm1_km1_ , ijk ) ijktt = myinds( ip0_jp0_kp2_ , ijk ) ijkbb = myinds( ip0_jp0_km2_ , ijk ) Complexity increases (i.e. cell neighbours indexes) 2D serial 2D parallel 3D parallel ! ipj=ij+1 ijp=ij+ib2 imj=ij-1 ijm=ij-ib2 imjm=ijm-1 ipjm=ipj-ib2 ipjp=ijp+1 imjp=imj+ib2 ! ijr=inds(ij,1) ijl=inds(ij,2) ijt=inds(ij,3) ijb=inds(ij,4) ijtr=inds(ij,5) ijtl=inds(ij,6) ijbr=inds(ij,7) ijrr=inds(ij,8) ijtt=inds(ij,9) ijkm = myijk( ip0_jp0_km1_, ijk ) imjk = myijk( im1_jp0_kp0_, ijk ) ipjk = myijk( ip1_jp0_kp0_, ijk ) ijkp = myijk( ip0_jp0_kp1_, ijk ) ipjkm = myijk( ip1_jp0_km1_, ijk ) ipjkp = myijk( ip1_jp0_kp1_, ijk ) imjkm = myijk( im1_jp0_km1_, ijk ) imjkp = myijk( im1_jp0_kp1_, ijk ) ijkpp = myijk( ip0_jp0_kp2_, ijk ) ippjk = myijk( ip2_jp0_kp0_, ijk ) immjk = myijk( im2_jp0_kp0_, ijk ) ijkmm = myijk( ip0_jp0_km2_, ijk ) ijke = myinds(ip1_jp0_kp0_, ijk ) ijkt = myinds(ip0_jp0_kp1_, ijk ) ijkw = myinds(im1_jp0_kp0_, ijk ) ijkb = myinds(ip0_jp0_km1_, ijk ) ijket = myinds(ip1_jp0_kp1_, ijk ) ijkwt = myinds(im1_jp0_kp1_, ijk ) ijkeb = myinds(ip1_jp0_km1_, ijk ) ijkwb = myinds(im1_jp0_km1_, ijk ) ijkee = myinds(ip2_jp0_kp0_, ijk ) ijktt = myinds(ip0_jp0_kp2_, ijk ) ijkww = myinds(im2_jp0_kp0_, ijk ) ijkbb = myinds(ip0_jp0_km2_, ijk ) computational grid ijp imj ij ipj ijm

  17. Code optimizations recover the performances of the F77 original code Slower 2 1.8 1.6 1.4 1.2 Execution time (relative to F77 code) 1 0.8 0.6 0.4 Faster 0.2 0 F77 Serial F90 Serial 2D Parallel 3D Parallel Unrolling Memory Parameter access dependent sub. code evolution

  18. IF( a(1,1) /= 0.D0 ) THEN div = 1.D0 / a(1,1) a(1,2) = a(1,2) * div a(1,3) = a(1,3) * div b(1) = b(1) * div a(1,1) = 0.D0 !li=2 amul = a(2,1) a(2,2) = a(2,2) - amul * a(1,2) a(2,3) = a(2,3) - amul * a(1,3) b(2) = b(2) - amul * b(1) !li=3 amul = a(3,1) a(3,2) = a(3,2) - amul * a(1,2) a(3,3) = a(3,3) - amul * a(1,3) b(3) = b(3) - amul * b(1) END IF IF( a(2,2) /= 0.D0 ) THEN div=1.D0/a(2,2) a(2,3)=a(2,3)*div b(2)=b(2)*div a(2,2)=0.D0 !li=1 amul=a(1,2) a(1,3)=a(1,3)-amul*a(2,3) b(1)=b(1)-amul*b(2) !li=3 amul=a(3,2) a(3,3)=a(3,3)-amul*a(2,3) b(3)=b(3)-amul*b(2) END IF IF( a(3,3) /= 0.D0 ) THEN div=1.D0/a(3,3) b(3)=b(3)*div a(3,3)=0.D0 !li=1 amul=a(1,3) b(1)=b(1)-amul*b(3) !li=2 amul=a(2,3) b(2)=b(2)-amul*b(3) END IF Example of optimization DO l=1,nphase IF(au1(l,l) /= 0.D0) THEN lp1=l+1 div=1.D0/au1(l,l) DO lj=lp1,nphase au1(l,lj)=au1(l,lj)*div END DO bu1(l)=bu1(l)*div au1(l,l)=0.D0 DO li=1,nphase amul=au1(li,l) DO lj=lp1,nphase au1(li,lj)=au1(li,lj)-amul*au1(l,lj) END DO bu1(li)=bu1(li)-amul*bu1(l) END DO END IF END DO loop unrolling, parameters dependend code

  19. we gain a factor of 2!

  20. Parallel Performances Computational Grid: 90x90x70 PDAC before optimizations PDAC after optimizations 16 16 14 14 12 12 ideal speed-up ideal speed-up 10 10 Speed Up Speed Up 8 8 6 6 4 4 2 2 0 0 0 2 4 6 8 10 12 14 16 0 2 4 6 8 10 12 14 16 Processors Processors no parallel performance loss due to optimizations

  21. 3D Simulation Run Fine Grid 2563 cells 1 simulation step 30 sec / 64 CPU 30 minutes of real 180000 steps simulation time Full 3D simulation 60 day / 64 CPU of one eruption scenario code after optimization 30 day / 64 CPU and tuning. next computer 7 day / 64 CPU (computer power double every 18 months, Moore law)

  22. Conclusion I’m really excited about this project and before the end of this year we will run the first 3D simulation of an eruption scenario Real and Syntetic Vesuvius

More Related