1 / 55

HTCondor at the RAL Tier-1

HTCondor at the RAL Tier-1. Andrew Lahiff. Overview. Current status at RAL Multi-core jobs Interesting features we’re using Some interesting features & commands Future plans. Current status at RAL. Background. Computing resources 784 worker nodes, over 14K cores

marlow
Download Presentation

HTCondor at the RAL Tier-1

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. HTCondor at the RAL Tier-1 Andrew Lahiff

  2. Overview • Current status at RAL • Multi-core jobs • Interesting features we’re using • Some interesting features & commands • Future plans

  3. Current status at RAL

  4. Background • Computing resources • 784 worker nodes, over 14K cores • Generally have 40-60K jobs submitted per day • Torque / Maui had been used for many years • Many issues • Severity & number of problems increased as size of farm increased • Doesn’t like dynamic resources • A problem if we want to extend batch system into the cloud • In 2012 decided it was time to start investigating moving to a new batch system

  5. Choosing a new batch system • Considered, tested & eventually rejected the following • LSF, Univa Grid Engine* • Requirement: avoid commercial products unless absolutely necessary • Open source Grid Engines • Competing products, not sure which has the next long term future • Communities appear less active than HTCondor & SLURM • Existing Tier-1s running Grid Engine using the commercial version • Torque 4 / Maui • Maui problematic • Torque 4 seems less scalable than alternatives (but better than Torque 2) • SLURM • Carried out extensive testing & comparison with HTCondor • Found that for our use case Very fragile, easy to break Unable to get reliably working above 6000 running jobs * Only tested open source Grid Engine, not Univa Grid Engine

  6. Choosing a new batch system • HTCondor chosen as replacement for Torque/Maui • Has the features we require • Seems very stable • Easily able to run 16,000 simultaneous jobs • Didn’t do any tuning – it “just worked” • Have since tested > 30,000 running jobs • Is more customizable than all other batch systems

  7. The story so far • History of HTCondor at RAL • Jun 2013: started testing with real ATLAS & CMS jobs • Sep 2013: 50% pledged resources moved to HTCondor • Nov 2013: fully migrated to HTCondor • Experience • Very stable operation • No changes needed as the HTCondor pool increased in size from ~1000 to ~14000 cores • Job start rate much higher than Torque / Maui even when throttled • Very good support

  8. Architecture CEs condor_schedd Central manager condor_negotiator condor_collector Worker nodes condor_startd

  9. Our setup • 2 central managers • condor_master • condor_collector • condor_HAD(responsible for high-availability) • condor_replication(responsible for high-availability) • condor_negotiator(only running on at most 1 machine at a time) • 8 submission hosts (4 ARC CE, 2 CREAM CE, 2 UI) • condor_master • condor_schedd • Lots of worker nodes • condor_master • condor_startd • Monitoring box (runs 8.1.x which contains ganglia integration) • condor_master • condor_gangliad

  10. Computing elements • ARC experience so far • Have run over 9.4 million jobs so far across our ARC CEs • Generally ignore them & they “just work” • VO status • ALTAS & CMS • Fine from the beginning • LHCb • Added ability to DIRAC to submit to ARC • ALICE • Not yet able to submit to ARC, have said they will work on this • Non-LHC VOs • Some use DIRAC, which can now submit to ARC • Some use EMI WMS, which can submit to ARC

  11. Computing elements • ARC 4.1.0 released recently • Will be in UMD very soon • Has just passed through staged-rollout • Contains all of our fixes to HTCondor backend scripts • Plans • When VOs start using RFC proxies we could enable the web service interface • Doesn’t affect ATLAS/CMS • VOs using NorduGrid client commands (e.g. LHCb) can get job status information more quickly

  12. Computing elements • Alternative: HTCondor-CE • Special configuration of HTCondor, not a brand new service • Some sites starting to use this in the US • Note: contains no BDII (!) site ATLAS & CMS pilot factories Central manager(s) HTCondor-CE collector(s) negotiator schedd schedd HTCondor-G Worker nodes startds job router

  13. Multi-core jobs

  14. Getting multi-core jobs to work • Job submission • Haven’t set up dedicated queues • VO has to request how many cores they want in their JDL • Fine for ATLAS & CMS, not sure yet for LHCb/DIRAC… • Could set up additional queues if necessary  • Did 5 things on the HTCondor side to support multi-core jobs…

  15. Getting multi-core jobs to work • Worker nodes configured to use partitionable slots • WN resources divided up as necessary amongst jobs • We had this configuration anyway • Setup multi-core accounting groups & associated quotas • Configured so that multi-core jobs automatically get assigned to the appropriate groups • Specified group quotas (fairshares) for the multi-core groups • Adjusted the order in which the negotiator considers groups • Consider multi-core groups before single core groups • 8 free slots are “expensive” to obtain, so try not to lose them too quickly

  16. Getting multi-core jobs to work • Setup condor_defrag daemon • Finds WNs to drain, triggers draining & cancels draining as required • Pick WNs to drain based on how many cores they have that can be freed up • E.g. getting 8 free cores by draining a full 32 core WN is generally faster than draining a full 8 core WN

  17. Getting multi-core jobs to work • Improvement to condor_defrag daemon • Demand for multi-core jobs not known by condor_defrag • Setup simple cron which adjusts number of concurrent draining WNs based on demand • If many idle multi-core jobs but few running, drain aggressively • Otherwise very little draining

  18. Results Running & idle multi-core jobs Gaps in submission by ATLAS results in loss of multi-core slots. Number of WNs running multi- core jobs & draining WNs • Significantly reduced CPU wastage • due to the cron • Aggressive draining: 3% waste • Less-aggressive draining: < 1% waste

  19. Multi-core jobs • Current status • Haven’t made any changes over the past few months • Now only CMS running multi-core jobs • Waiting for ATLAS to start up again • Will be interesting to see what happens when multiple VOs run multi-core jobs • May look at making improvements if necessary • Details about our configuration here: https://www.gridpp.ac.uk/wiki/RAL_HTCondor_Multicore_Jobs_Configuration

  20. Interesting features we’re using(or are about to use)

  21. Startdcron • Worker node health check script • Script run on WN at regular intervals by HTCondor • Can place custom information into WN ClassAds. In our case: • NODE_IS_HEALTHY (should WN start jobs or not) • NODE_STATUS (list of any problems) STARTD_CRON_JOBLIST = $(STARTD_CRON_JOBLIST) WN_HEALTHCHECK STARTD_CRON_WN_HEALTHCHECK_EXECUTABLE = /usr/local/bin/healthcheck_wn_condor STARTD_CRON_WN_HEALTHCHECK_PERIOD = 10m STARTD_CRON_WN_HEALTHCHECK_MODE = periodic STARTD_CRON_WN_HEALTHCHECK_RECONFIG = false STARTD_CRON_WN_HEALTHCHECK_KILL = true ## When is this node willing to run jobs? START = (NODE_IS_HEALTHY =?= True)

  22. Startdcron • Current checks • CVMFS • Filesystem problems (e.g. read-only) • Swap usage • Plans • May add more checks, e.g. CPU load • More thorough tests only run when HTCondor first starts up? • E.g. checks for grid middleware, checks for other essential software & configuration, … • Want to be sure that jobs will never be started unless the WN iscorrectly set up • Will be more important for dynamic virtual WNs

  23. Startdcron • Easily list any WNs with problems, e.g. # condor_status -constraint 'partitionableslot == True && NODE_STATUS != "All_OK”’-autoformat Machine NODE_STATUS lcg1209.gridpp.rl.ac.uk Problem: CVMFS for cms.cern.ch lcg1248.gridpp.rl.ac.uk Problem: CVMFS for cms.cern.ch lcg1309.gridpp.rl.ac.uk Problem: CVMFS for cms.cern.ch lcg1340.gridpp.rl.ac.uk Problem: CVMFS for cms.cern.ch gmetric script using HTCondor Python API for making Ganglia plots

  24. PID namespaces • We have USE_PID_NAMESPACES=True on WNs • Jobs can’t see any system processes or processes associated with other jobs on the WN • Example stdout of job running “ps –e”: PID TTY TIME CMD 1 ? 00:00:00 condor_exec.exe 3 ? 00:00:00 ps

  25. MOUNT_UNDER_SCRATCH • Each job sees a different /tmp, /var/tmp • Uses bind mounts to directories inside job scratch area • No more junk left behind in /tmp • Jobs can’t fill /tmp & cause problems • Jobs can’t see what other jobs have written into /tmp • For glexec jobs to work, need a special lcmaps plugin enabled • lcmaps-plugins-mount-under-scratch • Minor tweak to lcmaps.db • We have tested this but it’s not yet rolled out

  26. CPU affinity • Can set on WN ASSIGN_CPU_AFFINITY=True • Jobs locked to specific cores • Problem: • When PID namespaces also used, CPU affinity doesn’t work

  27. Control groups • Cgroups: mechanism for managing a set of processes • We’re starting with the most basic option: CGROUP_MEMORY_LIMIT_POLICY=none • No memory limits applied • Cgroups used for • Process tracking • Memory accounting • CPU usage assigned proportionally to the number of CPUs in the slot • Currently configured on 2 WNs for initial testing with production jobs

  28. Some interesting features & commands

  29. Upgrades • Central managers, CEs: • Update the RPMs • condor_master will notice the binaries have changed & restart daemons as required • Worker nodes • Make sure WN config contains MASTER_NEW_BINARY_RESTART=PEACEFUL • Update the RPMs • condor_master will notice the binaries have changed, drain running jobs, then restart daemons as required

  30. Dealing with held jobs • Held jobs have failed in some way & remain in the queue waiting for user intervention • E.g. input file(s) missing from CE • Can configure HTCondor to deal with them automatically • Try re-running held jobs once only after waiting 30 minutes SYSTEM_PERIODIC_RELEASE = ((CurrentTime - EnteredCurrentStatus > 30 * 60) && (JobRunCount < 2)) • Remove held jobs after 24 hours SYSTEM_PERIODIC_REMOVE = ((CurrentTime- EnteredCurrentStatus> 24 * 60 * 60) && JobStatus == 5))

  31. condor_who • See what jobs are running on a worker node [root@lcg0975 ~]# condor_who OWNER CLIENT SLOT JOB RUNTIME PID PROGRAM patls053@gridpp.rl.ac.uk arc-ce03.gridpp.rl.ac.uk 1_5 2730014.0 0+03:30:12 30184 /pool/condor/dir_30180/condor_exec.exe patls053@gridpp.rl.ac.uk arc-ce01.gridpp.rl.ac.uk 1_7 3189534.0 0+03:38:19 21266 /pool/condor/dir_21262/condor_exec.exe tatls011@gridpp.rl.ac.uk arc-ce03.gridpp.rl.ac.uk 1_6 2729613.0 0+04:40:21 6942 /pool/condor/dir_6938/condor_exec.exe tlhcb005@gridpp.rl.ac.uk arc-ce02.gridpp.rl.ac.uk 1_4 2977866.0 0+08:42:13 26669 /pool/condor/dir_26665/condor_exec.exe tatls001@gridpp.rl.ac.uk arc-ce01.gridpp.rl.ac.uk 1_1 3186150.0 0+10:57:05 12401 /pool/condor/dir_12342/condor_exec.exe tlhcb005@gridpp.rl.ac.uk arc-ce01.gridpp.rl.ac.uk 1_3 3174829.0 0+16:03:57 26418 /pool/condor/dir_26331/condor_exec.exe pcms054@gridpp.rl.ac.uk arc-ce01.gridpp.rl.ac.uk 1_2 3149655.0 1+23:55:51 31281 /pool/condor/dir_31268/condor_exec.exe

  32. condor_q –analyze 1 • Why isn’t a job running? -bash-4.1$ condor_q -analyze 16244 -- Submitter: lcgui03.gridpp.rl.ac.uk : <130.246.180.41:45033> : lcgui03.gridpp.rl.ac.uk User priority for alahiff@gridpp.rl.ac.uk is not available, attempting to analyzewithout it. --- 16244.000: Run analysis summary. Of 13180 machines, 13180 arerejected by yourjob'srequirements 0 rejectyour job because of theirownrequirements 0 match and arealreadyrunningyour jobs 0 match but areservingotherusers 0 areavailable to run your job WARNING: Be advised: No resourcesmatchedrequest'sconstraints

  33. condor_q–analyze 2 The Requirements expression for your job is: ( ( Ceph is true ) ) && ( TARGET.Arch == "X86_64" ) && ( TARGET.OpSys == "LINUX" ) && ( TARGET.Disk >= RequestDisk ) && ( TARGET.Memory >= RequestMemory ) && ( TARGET.Cpus >= RequestCpus ) && ( TARGET.HasFileTransfer ) Suggestions: Condition Machines Matched Suggestion --------- ---------------- ---------- 1 ( TARGET.Cpus >= 16 ) 0 MODIFY TO 8 2 ( ( target.Ceph is true ) ) 139 3 ( TARGET.Memory >= 1 ) 13150 4 ( TARGET.Arch == "X86_64" ) 13180 5 ( TARGET.OpSys == "LINUX" ) 13180 6 ( TARGET.Disk >= 1 ) 13180 7 ( TARGET.HasFileTransfer ) 13180

  34. condor_ssh_to_job • ssh into a job from a CE [root@arc-ce01 ~]# condor_ssh_to_job 3147487.0 Welcometo slot1@lcg1554.gridpp.rl.ac.uk! Your condor job is running withpid(s) 2402. [pcms054@lcg1554 dir_2393]$ hostname lcg1554.gridpp.rl.ac.uk [pcms054@lcg1554 dir_2393]$ ls condor_exec.exe _condor_stderr _condor_stderr.schedd_glideins4_vocms32.cern.ch_1497953.4_1400484359 _condor_stdout _condor_stdout.schedd_glideins4_vocms32.cern.ch_1497953.4_1400484359 glide_adXvcQ glidein_startup.sh job.MCJMDmk8W6jnCIXDjqiBL5XqABFKDmABFKDmb5JKDmEBFKDmQ8FM6m.proxy

  35. condor_fetchlog • Retrieve log files from daemons on other machines [root@condor01 ~]# condor_fetchlog arc-ce04.gridpp.rl.ac.uk SCHEDD 05/20/14 12:39:09 (pid:2388) ****************************************************** 05/20/14 12:39:09 (pid:2388) ** condor_schedd (CONDOR_SCHEDD) STARTING UP 05/20/14 12:39:09 (pid:2388) ** /usr/sbin/condor_schedd 05/20/14 12:39:09 (pid:2388) ** SubsystemInfo: name=SCHEDD type=SCHEDD(5) class=DAEMON(1) 05/20/14 12:39:09 (pid:2388) ** Configuration: subsystem:SCHEDDlocal:<NONE> class:DAEMON 05/20/14 12:39:09 (pid:2388) ** $CondorVersion: 8.0.6 Feb 01 2014 BuildID: 225363 $ 05/20/14 12:39:09 (pid:2388) ** $CondorPlatform: x86_RedHat6 $ 05/20/14 12:39:09 (pid:2388) ** PID = 2388 05/20/14 12:39:09 (pid:2388) ** Log last touched time unavailable (No such fileordirectory) 05/20/14 12:39:09 (pid:2388) ****************************************************** ...

  36. condor_gather_info • Gathers information about a job • Including log files from schedd, startd [root@arc-ce01 ~]# condor_gather_info --jobid 3288142.0 cgi-root-jid3288142.0-2014-05-31-09_36_59PM-BST/ cgi-root-jid3288142.0-2014-05-31-09_36_59PM-BST/ShadowLog.old cgi-root-jid3288142.0-2014-05-31-09_36_59PM-BST/ShadowLog cgi-root-jid3288142.0-2014-05-31-09_36_59PM-BST/StartLog cgi-root-jid3288142.0-2014-05-31-09_36_59PM-BST/MasterLog cgi-root-jid3288142.0-2014-05-31-09_36_59PM-BST/MasterLog.lcg1043.gridpp.rl.ac.uk cgi-root-jid3288142.0-2014-05-31-09_36_59PM-BST/condor-profile.txt cgi-root-jid3288142.0-2014-05-31-09_36_59PM-BST/3288142.0/ cgi-root-jid3288142.0-2014-05-31-09_36_59PM-BST/3288142.0/job_q cgi-root-jid3288142.0-2014-05-31-09_36_59PM-BST/3288142.0/job_userlog_lines cgi-root-jid3288142.0-2014-05-31-09_36_59PM-BST/3288142.0/job_ad_analysis cgi-root-jid3288142.0-2014-05-31-09_36_59PM-BST/3288142.0/job_ad cgi-root-jid3288142.0-2014-05-31-09_36_59PM-BST/SchedLog.old cgi-root-jid3288142.0-2014-05-31-09_36_59PM-BST/StarterLog.slot1

  37. Job ClassAds • Lots of useful information in job ClassAds • Including email address from proxy • Easy to contact users of problematic jobs # condor_q3279852.0 -autoformat x509UserProxyEmail andrew.lahiff@stfc.ac.uk

  38. condor_chirp • Jobs can put custom information into job ClassAds • Example: lcmaps-plugin-condor-update • Puts information into job ClassAd about glexec payload user & DN • Can then use condor_q to see this information

  39. Job router • Job router daemon transforms jobs from one type to another according to configurable policies • E.g. submit jobs to a different batch system or a CE • Example: sending excess jobs to the GridPP Cloud using glideinWMS JOB_ROUTER_DEFAULTS = \ [ \ MaxIdleJobs = 10; \ MaxJobs = 50; \ ] JOB_ROUTER_ENTRIES = \ [ \ Requirements=true; \ GridResource = "condor lcggwms02.gridpp.rl.ac.uk lcggwms02.gridpp.rl.ac.uk:9618"; \ name = "GridPP_Cloud"; \ ]

  40. Job router • Example: initially have 5 idle jobs -bash-4.1$ condor_q -- Submitter: lcgvm21.gridpp.rl.ac.uk : <130.246.181.102:40557> : lcgvm21.gridpp.rl.ac.uk ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 2249.0 alahiff 6/2 20:57 0+00:00:00 I 0 0.0 (CMSAnalysis ) 2249.1 alahiff 6/2 20:57 0+00:00:00 I 0 0.0 (CMSAnalysis ) 2249.2 alahiff 6/2 20:57 0+00:00:00 I 0 0.0 (CMSAnalysis ) 2249.3 alahiff 6/2 20:57 0+00:00:00 I 0 0.0 (CMSAnalysis ) 2249.4 alahiff 6/2 20:57 0+00:00:00 I 0 0.0 (CMSAnalysis ) 5 jobs; 0 completed, 0 removed, 5 idle, 0 running, 0 held, 0 suspended

  41. Job router • Routed copies of the jobs soon appear • Original job mirrors the status of the routed copy -bash-4.1$ condor_q -- Submitter: lcgvm21.gridpp.rl.ac.uk : <130.246.181.102:40557> : lcgvm21.gridpp.rl.ac.uk ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 2249.0 alahiff 6/2 20:57 0+00:00:00 I 0 0.0 (CMSAnalysis ) 2249.1 alahiff 6/2 20:57 0+00:00:00 I 0 0.0 (CMSAnalysis ) 2249.2 alahiff 6/2 20:57 0+00:00:00 I 0 0.0 (CMSAnalysis ) 2249.3 alahiff 6/2 20:57 0+00:00:00 I 0 0.0 (CMSAnalysis ) 2249.4 alahiff 6/2 20:57 0+00:00:00 I 0 0.0 (CMSAnalysis ) 2250.0 alahiff 6/2 20:57 0+00:00:00 I 0 0.0 (CMSAnalysis ) 2251.0 alahiff 6/2 20:57 0+00:00:00 I 0 0.0 (CMSAnalysis ) 2252.0 alahiff 6/2 20:57 0+00:00:00 I 0 0.0 (CMSAnalysis ) 2253.0 alahiff 6/2 20:57 0+00:00:00 I 0 0.0 (CMSAnalysis ) 2254.0 alahiff 6/2 20:57 0+00:00:00 I 0 0.0 (CMSAnalysis ) 10 jobs; 0 completed, 0 removed, 10 idle, 0 running, 0 held, 0 suspended

  42. Job router • Can check that the new jobs have been sent to a remote resource [root@lcgvm21 ~]# condor_q -grid -- Submitter: lcgvm21.gridpp.rl.ac.uk : <130.246.181.102:40557> : lcgvm21.gridpp.rl.ac.uk ID OWNER STATUS GRID->MANAGER HOST GRID_JOB_ID 2250.0 alahiff IDLE condor->lcggwms02.gridpp.rl 20.0 2251.0 alahiff IDLE condor->lcggwms02.gridpp.rl 21.0 2252.0 alahiff IDLE condor->lcggwms02.gridpp.rl 17.0 2253.0 alahiff IDLE condor->lcggwms02.gridpp.rl 18.0 2254.0 alahiff IDLE condor->lcggwms02.gridpp.rl 19.0 Job ids on remote resource glideinWMSHTCondor pool

  43. Future plans • Setting up ARC CE & some WNs using Ceph as a shared storage system (CephFS) • ATLAS testing with arcControlTower • Pulls jobs from PanDA, pushes jobs to ARC CEs • Input files pre-staged & cached on Ceph by ARC • Currently in progress…

  44. Future plans • Test power management features • HTCondor can power down idle WNs and wake them as required • Batch system expanding into the cloud • Make use of idle private cloud resources • We have tested that condor_rooster can be used to dynamically provision VMs as they are needed

  45. Questions?

  46. Backup slides

  47. Monitoring

  48. Overview • Batch system monitoring • Mimic • Jobview • Ganglia • Elasticsearch

  49. Mimic • Overview of state of worker nodes

  50. Jobview http://sarkar.web.cern.ch/sarkar/doc/condor_jobview.html

More Related