280 likes | 371 Views
Eugene Mirvis, RAL/NCAR@EMC NOAA. BUILDING FEE/FSE ENVIRONMENT @ JNT/DTC/DET NEMS COMPONENTS PORTING AND AVAILABILITY on “JET” and “ Bluefire ”. 12/23/2011. 1. DTC/DET Environment Components on “JET” & “ Bluefire ”. Porting :: FEE :: Availability.
E N D
Eugene Mirvis, RAL/NCAR@EMC NOAA BUILDING FEE/FSE ENVIRONMENT @ JNT/DTC/DET NEMS COMPONENTS PORTING AND AVAILABILITY on “JET” and “ Bluefire” 12/23/2011 1
Porting :: FEE :: Availability Sufficient Computing and Storage Resources Compilers Data Availability Ported Apps Source Code & update capabilities Regression testing Data and jobs Re- scripting 3rd party libraries availability Version control repository, sources. Building system scripting GSI data, sufficient networking Sufficient TMP and HIST space Flexible Modular Environment Gridded NAM, GFS, snow, Ice, SST
WORKIG ENVIRONMENT ON JET :: .chsrc #/bin/tcsh module switch icc/11.1.072 icc/11.1.073 module switch intel/11.1.072 intel/11.1.073 module switch mvapich2/1.4.1-intel-11.1 mvapich2/1.6-intel-11.1 setenv SVN_EDITOR nano alias wdir "cd /lfs0/projects/dtc/NEMS" alias ldata "cd /mnt/lfs0/projects/dtc/NEMS/NEMS_libs_data" setenv PATH ${PATH}:/opt/mvapich2/1.4.1-intel-11.1/include setenv ESMF_DIR /mnt/lfs0/projects/dtc/NEMS/NEMS_libs_data/NEMSlibs/esmf/esmf_3_1_0rp5m2g setenv DIR_ESMF /mnt/lfs0/projects/dtc/NEMS/NEMS_libs_data/NEMSlibs/esmf/esmf_3_1_0rp5m2g setenv ESMF_COMM mpich2 setenv ESMF_OPT O setenv DIR_NETCDF "$NETCDF/lib/" ##setenv ESMF_OPT g setenv BASEDIR /mnt/lfs0/projects/dtc/NEMS/NEMS_libs_data/NEMSlibs setenvNEMSdata /mnt/lfs0/projects/dtc/NEMS/NEMS_libs_data ###setenv ESMF_MPIRUN "mpirun_rsh -np 2 h1 h1 "
WORKIG ENVIRONMENT ON :: BLUEFIRE #/bin/tcsh source /contrib/Modules/3.2.6/init/csh module load makedepf90 module load grads module load svn-1.5 module load netcdf/3.6.2_deprecated module switch xlf/12.01.0000.0005.091127 xlf/12.01.0000.0003 alias wdir "cd /glade/proj2/ral/RJNTB/NEMS/" #module switch icc/11.1.072 icc/11.1.073 #module switch intel/11.1.072 intel/11.1.073 #module switch mvapich2/1.4.1-intel-11.1 mvapich2/1.6-intel-11.1 setenv SVN_EDITOR nano setenv NEMS "/glade/proj2/ral/RJNTB/NEMS" alias wdir "cd ${NEMS}" alias ldata "cd ${NEMS}/NEMS_libs_data" setenvNEMSdata "${NEMS}/NEMS_libs_data" setenvNEMSlibs "${NEMSdata}/NEMSlibs" #setenv PATH ${PATH}:/opt/mvapich2/1.4.1-intel-11.1/include setenv ESMF_DIR ${NEMSlibs}/esmf/esmf_3_1_0rp5m3 setenv DIR_ESMF ${NEMSlibs}/esmf/esmf_3_1_0rp5m3 #setenv ESMF_COMM mpich2 setenv INC_ESMF $(DIR_ESMF)/mod/modO/AIX.default.64.mpi.default setenv ESMF_OPT O setenv DIR_NETCDF "$NETCDF/lib/" ##setenv ESMF_OPT g setenv BASEDIR $DIR_NETCDF ###setenv ESMF_MPIRUN "mpirun_rsh -np 2 h1 h1 " setenv LOGIN emirvis setenv PATH ${PATH}:/usr/local/bin/ setenv NWPROD $NEMSdata/nwprod
Where is everything on JET? [emirvis@fe2 NEMS]$ pwd /mnt/lfs0/projects/dtc/NEMS [emirvis@fe2 NEMS_libs_data]$ pwd /mnt/lfs0/projects/dtc/NEMS/NEMS_libs_data Where all data and libraries on JET? [emirvis@fe2 NEMS_libs_data]$ ls -trla total 44 drwxr-sr-x 3 emirvisdtc 4096 Sep 18 16:23 stmp drwxr-sr-x 5 emirvisdtc 4096 Sep 29 21:04 nwprod drwxr-sr-x 3 emirvisdtc 4096 Oct 11 21:08 RTdata drwxr-sr-x 22 emirvisdtc 4096 Nov 1 18:47 REGRESSION_TEST drwxr-sr-x 27 emirvisdtc 4096 Nov 17 02:21 .. drwxr-sr-x 35 emirvisdtc 4096 Dec 6 00:11 ptmp drwxr-sr-x 9 emirvisdtc 4096 Dec 14 23:39 NEMSlibs drwxr-sr-x 11 emirvisdtc 4096 Dec 14 23:39 . drwxr-sr-x 3 emirvisdtc 4096 Dec 14 23:40 LIB drwxr-sr-x 7 emirvisdtc 4096 Dec 14 23:40 INCMOD
Bluefire /glade/proj2/ral/RJNTB/NEMS: The one which was running: /glade/proj2/ral/RJNTB/NEMS/14601/ /glade/proj2/ral/RJNTB/NEMS/14601/job/regression_tests The one from NCO (I couldn’t run yet): /glade/proj2/ral/RJNTB/NEMS/nam_nems_nmmb_fcst.fd_em LIBS+DATA: /glade/proj2/ral/RJNTB/NEMS/NEMS_libs_data My Bluefire RUN Results: /glade/proj2/ral/RJNTB/NEMS/NEMS_libs_data/ptmp/RT_111/NMM_CNTRL
JET: CODE/SCRIPTS:: /mnt/lfs0/projects/dtc/NEMS/ drwx--S--- 7 emirvisdtc 4096 Feb 11 2011 12204 drwxr-sr-x 5 emirvisdtc 4096 Mar 31 2011 lib drwx--S--- 7 emirvisdtc 4096 Apr 28 2011 13612 drwx--S--- 7 emirvisdtc 4096 Jun 1 2011 13537 drwxr-sr-x 7 emirvisdtc 4096 Jun 16 22:46 13537wrk drwxr-sr-x 7 emirvisdtc 4096 Jul 13 19:16 10566_em drwx--S--- 7 emirvisdtc 4096 Aug 18 23:51 13537wrk2 drwxr-sr-x 7 emirvisdtc 4096 Sep 8 12:46 15367 drwxr-sr-x 7 emirvisdtc 4096 Sep 8 12:49 13537wrk_cln drwx--S--- 7 emirvisdtc 4096 Oct 6 15:52 14601nco drwxrwsr-x 5 ligiadtc 4096 Oct 20 16:57 .. drwxr-xr-x 7 emirvisdtc 4096 Oct 21 20:35 14601nco2 drwxr-sr-x 7 emirvisdtc 4096 Nov 4 19:37 15555 drwxr-sr-x 7 emirvisdtc 4096 Nov 7 23:23 16000 drwxr-sr-x 7 emirvisdtc 4096 Nov 8 20:16 16148 drwxr-sr-x 7 emirvisdtc 4096 Nov 9 04:34 16149 drwxr-sr-x 6 emirvisdtc 4096 Nov 9 05:01 good1 drwxr-sr-x 7 emirvisdtc 4096 Nov 9 20:41 14601nco2.11.3 drwxr-sr-x 7 emirvisdtc 4096 Nov 10 12:29 NEMS-15555 drwxr-sr-x 7 emirvisdtc 4096 Nov 10 14:38 NEMS-NCO_ops-2011 drwxr-sr-x 7 emirvisdtc 4096 Nov 17 02:22 dtrunk_jet drwxr-sr-x 11 emirvisdtc 4096 Dec 14 23:39 NEMS_libs_data drwxr-sr-x 2 emirvisdtc 4096 Dec 15 09:11 TARS drwxr-sr-x 7 emirvisdtc 4096 Dec 15 09:29 dEM ALL you want to use sould be available now at: /mnt/lfs0/projects/dtc/NEMS/TARS (see next)
To compile: • cd $NEMS_DIR/src • gmake clean • gmakenmm • Other options: • gmakenmm_gfs --- for both NMM and GFS core - without the GOCART • or • gmake gen --- for GEN core only • or • gmakenmm_gfs_gen_post GOCART_MODE=full --- for GEN, NMM and GFS cores - with the GOCART and post • or • gmakenmm_gfs GOCART_MODE=full --- for both NMM and GFS core - with the GOCART • or • gmakenmm --- for NMM core only • or • gmakenmm_post --- for NMM core only - with post • or • gmakegfs --- for GFS core only - without the GOCART • or • gmakegfs_post --- for GFS core only - with post without the GOCART • or • gmakegfs GOCART_MODE=full --- for GFS core only - with the GOCART • or • gmakefim --- for FIM core only • or • gmakefim_gfs --- for both FIM and GFS core only - without the GOCART • or • gmakefim_gfs GOCART_MODE=full --- for both FIM and GFS core only - with the GOCART Where is a tarfile to get: /mnt/lfs0/projects/dtc/NEMS/TARS
[emirvis@fe3 TARS]$ ls -trla total 3491916 drwxr-sr-x 25 emirvisdtc 4096 Dec 15 19:39 .. -rw-r--r-- 1 emirvisdtc 172963840 Dec 15 20:30 16149.tar -rw-r--r-- 1 emirvisdtc 112517120 Dec 15 20:33 NEMS-15555.tar -rw-r--r-- 1 emirvisdtc 287057920 Dec 15 20:53 NEMS-NCO_ops-2011.tar drwxr-sr-x 5 emirvisdtc 4096 Dec 22 15:55 nwprod drwxr-sr-x 3 emirvisdtc 4096 Dec 22 15:56 RTdata -rw-r--r-- 1 emirvisdtc 2585804800 Dec 22 19:59 RTdata.tar -rw-r--r-- 1 emirvisdtc 229847040 Dec 22 20:00 nwprod.tar -rw-r--r-- 1 emirvisdtc 187473920 Dec 22 20:03 NEMSlibs-dtrb.tar drwxr-sr-x 7 emirvisdtc 4096 Dec 22 20:45 . drwxr-sr-x 10 emirvisdtc 4096 Dec 22 20:48 NEMSlibs-dtrb drwxr-sr-x 2 emirvisdtc 4096 Dec 22 20:49 LIB drwxr-sr-x 7 emirvisdtc 4096 Dec 22 20:49 INCMOD You will need: One of the source revision tarball: NEMS-NCO_ops-2011.tar Regression Test data : RTdata.tar $NWPROD min tarball :: nwprod.tar NEMSlibs distr. Tarball :: NEMSlibs-dtrb.tar Untar it in separate directories and goto: “NEMSlibs-dtrb” And run script: “only_3rd_libs_BUILT_LIBS” If you don’t have esmf or makedepf90 as modules – built it first using Script GLOBAL_BUILT_LIBS-1
emirvis@fe1 NEMSlibs]$ ls -trla total 1300 drwxr-sr-x 2 emirvisdtc 4096 Dec 14 18:33 bacio drwxr-sr-x 7 emirvisdtc 4096 Dec 14 18:37 esmf drwxr-sr-x 4 emirvisdtc 4096 Dec 14 18:39 Linux drwxr-sr-x 2 emirvisdtc 4096 Dec 14 18:39 sp drwxr-sr-x 2 emirvisdtc 12288 Dec 14 18:39 w3_g drwxr-sr-x 4 emirvisdtc 4096 Dec 14 18:39 nemsio drwxr-sr-x 2 emirvisdtc 4096 Dec 14 18:39 lib drwxr-sr-x 2 emirvisdtc 4096 Dec 14 18:39 LIB drwxr-sr-x 7 emirvisdtc 4096 Dec 14 18:39 G_libs drwxr-sr-x 6 emirvisdtc 4096 Dec 14 18:39 makedepf90-2.8.8 drwxr-sr-x 6 emirvisdtc 4096 Dec 14 18:39 new_nemsio drwxr-sr-x 6 emirvisdtc 4096 Dec 14 18:40 src drwxr-sr-x 3 emirvisdtc 4096 Dec 14 18:40 sp_g -rw-r--r-- 1 emirvisdtc 1200696 Dec 14 18:40 libw3.a drwxr-sr-x 3 emirvisdtc 4096 Dec 14 18:40 june_gfsio drwxr-sr-x 6 emirvisdtc 4096 Dec 14 18:40 incmod drwxr-sr-x 8 emirvisdtc 4096 Dec 14 18:40 INCMOD drwxr-sr-x 2 emirvisdtc 4096 Dec 14 18:40 lib_good drwxr-sr-x 6 emirvisdtc 4096 Dec 14 18:40 incmod_good drwxr-sr-x 2 emirvisdtc 4096 Dec 14 18:40 LIB_good drwxr-sr-x 8 emirvisdtc 4096 Dec 14 18:40 INCMOD_good drwxr-sr-x 2 emirvisdtc 4096 Dec 14 18:40 ___w3 drwxr-sr-x 3 emirvisdtc 4096 Dec 14 18:40 ___w3_opn_gblevent drwxr-sr-x 3 emirvisdtc 4096 Dec 14 18:40 jun_new_nemsio_nemsio+ drwxr-sr-x 2 emirvisdtc 4096 Dec 14 18:40 jun_new_nemsio_w3+ drwxr-sr-x 29 emirvisdtc 4096 Dec 14 18:40 . drwxr-sr-x 2 emirvisdtc 4096 Dec 14 18:40 jun_new_nemsio_bacio+ drwxr-sr-x 3 emirvisdtc 4096 Dec 14 20:23 gfsio drwxr-sr-x 2 emirvisdtc 4096 Dec 14 20:24 sigio drwxr-sr-x 11 emirvisdtc 4096 Dec 15 18:58 .. [emirvis@fe1 NEMSlibs]$ pwd /mnt/lfs0/projects/dtc/NEMS/NEMS_libs_data/NEMSlibs 3rd-parties LIBS (all)
/meso/noscrub Original NEMS Regression test data /stmp REGRESSION_TEST Originating | prev. results:: $NEMS_RT_stmp /ptmp REGRESSION_TEST_baselines Place to compare :: $NEMS_RT_ptmp
Running RT job scripts Set ting Computational EnvVars RT.sh nmm_ll.IN Model switch nmm_conf.IN rt_fim.sh … rt_*.sh rt_nmm.sh rt_gfs.sh gfs_ll.IN *_ll.IN nmm_ll.sh gfs_ll.sh *_ll.sh *_conf.IN gfs_conf.IN POE / PBS Submit
Submitting a single RT • Submit RT- Job script • Wait for the start • Follow your RUN… • Wait for RUN is finished • Check results or COPY the results
FEE Criterions • DRAFT 2011-09-16 (EM : based on Mark Iredell’s thoughts) • Procedure for defining functional equivalence on different environments • Select key fields from the import state Fi • may include a selection of initial cases • Run control run C of application on control environment • Perturb each Fi in the last significant digit • nominally after the 6th decimal digit for 32-bit reals • for each value Fi(x), construct a distribution • Gaussian distribution with width being | Fi(x)/1.e6| • randomly select perturbation from that distribution • Run perturbed run P from perturbed import state on control environment. • Select key fields from the export states Cj (control run) and Pj (perturbed run). • may include a selection of output times • Determine area weight w(x) • Compute the butterfly vector • Butterfly vector will have units of original export fields • Now run control run but on different environment D • Compute difference vector like butterfly vector but using D instead of P. • Compute dimensionless difference butterfly number vector • For a bit-identical requirement, must be zero • For a functionally equivalent requirement, must be largely less than 10
Components of the TESTBED FSE/FEE • Architecture:: Hardware, Registries, Cache, Processors, Memory management , OS, Data storage and management, support • Resources:: PEs, memory, cycles, queues etc. • Environment:: OS version, SYS Libraries, Compilers/Versions, Flexible module environment, 3rd party libraries, flex Env variables mechanism • Built system:: scripting, workflows, launchers, parallel submission • Data :: Availability, formats, compatibility, precision, • Code :: Documentation, Self-documenting , auto configuration, testing/evaluation and benchmark system
E.Mirvis: Proposed NCAR-EMC cross- repository version control (DRAFT) DTC Dev Trunk DTC VerX.n branch EMC Dev Trunk EMC VerY.m branch Stable branch Stable branch VerY.m VerX.n Ver .m+1 Ver X.n+1 Dashed: Read or Link Solid - Full ctrl EMC DTC - Rep. Locations
NEMS regression tests Based on Standard NEMS test development E. M.: includes some materials from Ratko Vasic Apr 2011 presentation for Cirrus
Inspiration: • Several model cores + several people = accident waiting to happen • Complicated structure, more components – more room for errors Solution: • Series of small tests checking as many as possible combinations (regression tests)
What do we test? • Bit-identical result reproducibility • Different domain decomposition • Threading • Different physics options • Timing (only NMMB, should do it for GFS too) • Different compilations • And many more…
Recent changes in Regression test: • Separation of the tests to “BIG” and “SMALL” (names are not determined yet, we may call them “long”, “quick”, “fast”…) • GFS submit scripts • New compilation options
“BIG” and “SMALL” tests: • “SMALL” test should be performed before any repository commit • It consist of 8 NMMB runs, 7 GFS runs and 1 GEN • Takes about 1.5 hour to finish • “BIG” run is ran from cron once a week, every Saturday • It has 46 tests and 4 different compilations (ESMF3, ESMF5, NMM only, NMM with TRAPS ON, GFS only should be added) • Took 7 hours to finish
Regression test options:run RT.sh <arg1> <arg2>EM: I wouldn’t suggest you to run RT.sh … yet beyond the first test (baseline) on Jetbefore you will be able to replicate my results by running for instance on Jet:> cd /mnt/lfs0/projects/dtc/NEMS/NEMS-15555/job/regression_tests[emirvis@fe3 regression_tests]$ qsub nmm_ll_em15555Your job 5282788 ("NEMS_RT15555") has been submitted • Regular (SMALL) run • No arguments • Create baseline(s) • One argument: nmm, gfs, gen or all • Run “BIG” test • Two arguments: “big test” Other usage: • Edit RT.sh and run either model/option/test • EM: again, while I’m working to extend tests please edit your copies of nmm_ll files
NMMB tests: 1 - Baseline global to compare with previous trunk version: free fcst, pure binary input. 2 - Baseline global with NEMSIO input: free fcst, NEMSIO input. 3 - Global restart run: restart, pure binary input. 4 - Global restart run with NEMSIO input: restart, NEMSIO input. 5 - Global with different domain decomposition: 3x5 compute tasks with single thread, opnl physics, free fcst, pure binary input. 6 - Global with multiple threading: 6x5 compute tasks with 2 threads, opnl physics, free fcst, pure binary input. 7 - Global with GFS physics: GFS physics, free fcst, pure binary input. 8 - Baseline regional to compare with previous trunk version: free fcst, pure binary input. 9 - Baseline regional with NEMSIO input: free fcst, NEMSIO input. 10 - Regional restart run: restart, pure binary input. 11 - Regional restart run with NEMSIO input: restart, NEMSIO input. 12 - Regional with different domain decomposition: 3x5 compute tasks with single thread, opnl physics, free fcst, pure binary input. 13 - Regional with multiple threading: 6x5 compute tasks with 2 threads, opnl physics, free fcst, pure binary input. 14 - Regional with GFS physics: GFS physics, free fcst, pure binary input. 15 - Regional with nesting: regional parent with two children and one grandchild, single thread, opnl physics, free fcst, pure binary input 16 - Regional restart with nesting: regional parent with two children and one grandchild, single thread, opnl physics, free fcst, pure binary input 17 - Regional with precipitation adjustment on: free fcst, pure binary input. 18 - Regional writing time series: free fcst, pure binary input. 19 - Regional with nesting: test digital filter
The way you can run NMMB @ DTC for now: What do you need to RUN in your job/regression_test dir: ts_locations.nml phy_state.txt dyn_state.txt nmm_ll* (script) ../exe/NEMS.x RESULTS:: My 48 hrs run results on JET [emirvis@fe1 ptmp]$ ls 48_hst_comp RT_111-0 RT_15555 RT_203dtest_du1 RT_05 RT_111-1 RT_15555+ RT_444 RT_06 RT_111-2 RT_15555_o RT_555 RT_07 RT_111-3 RT_201 RT_777 RT_08 RT_111-4 RT_202 RT_NEMS-NCO_ops-2011 RT_09 RT_112 RT_203-1 test RT_11 RT_12 RT_203-1.1-good1 RT_111 RT_14 RT_203-16149-full RT_111+1 RT_14601nco2.11.3 RT_203dtest Located at: [emirvis@fe1 ptmp]$ pwd /mnt/lfs0/projects/dtc/NEMS/NEMS_libs_data/ptmp My 48 hrs run results on “Bluefire” /glade/proj2/ral/RJNTB/NEMS/NEMS_libs_data/ptmp/RT_111/NMM_CNTRL