1 / 15

Report from USA

Report from USA. Massimo Sgaravatto INFN Padova. Introduction. Workload management system for productions Monte Carlo productions, data reconstructions and production analyses Scheduled activities Goal: optimization of overall throughput. Possible architecture. Resource Discovery.

cece
Download Presentation

Report from USA

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Report from USA Massimo Sgaravatto INFN Padova

  2. Introduction • Workload management system for productions • Monte Carlo productions, data reconstructions and production analyses • Scheduled activities • Goal: optimization of overall throughput

  3. Possible architecture Resource Discovery Master GIS Submit jobs condor_submit (Globus Universe) condor_q condor_rm … Information on characteristics and status of local resources Personal Condor globusrun GRAM GRAM GRAM CONDOR LSF PBS Site1 Site2 Site3

  4. Overview • GRAM as uniform interface to different local resource management systems • Personal Condor able to provide robustness and reliability • The user can submit his 10,000 jobs and he will be sure that they will be completed (even if there are problems in the submitting machine, in the executing machines, in the network, …) without human intervention • Usage of Condor interface and tools to “manage” the jobs • “Robust” tools with all the required capabilities (monitor, logging, …) • Master smart enough to decide in which Globus resources the jobs must be submitted • The Master uses the information on characteristics and status of resources published in the GIS

  5. Globus GRAM • Fixed problems: • I/O with vanilla Condor jobs • Globus-job-status with LSF and Condor • Publishing of Globus LSF and Condor jobs in the GIS • Open problems: • Submission of multiple instances of a same job to a LSF cluster • Necessary to modify the Globus LSF scripts • Scalability • Fault tolerance

  6. Globus GRAM Architecture Globus front-end machine Client %globusrun –b –r lxpd.pd.infn.it/jobmanager-lsf \ –f file.rsl file.rsl: & (executable=$(CMS)/startcmsim.sh) (stdin=$(CMS)/Pythia/inp) (stdout=$(CMS)/Cmsim/out) (count=1) (queue=cmsprod) LSF/ Condor/ PBS/ … Jobmanager Job

  7. Scalability • One jobmanager for each globusrun • If I want to submit 1000 jobs ??? • 1000 globusrun • 1000 jobmanagers running in the front-end machine !!! • %globusrun –b –r lxpd.pd.infn.it/jobmanager-lsf –f file.rsl file.rsl: & (executable=$(CMS)/startcmsim.sh) (stdin=$(CMS)/Pythia/inp) (stdout=$(CMS)/Cmsim/out) (count=1000) (queue=cmsprod) • Problems with LSF • It is not possible to specify in the RSL file 1000 different input files and 1000 different output files …

  8. Fault tolerance • The jobmanager is not persistent • If the jobmanager can’t be contacted, Globus assumes that the job(s) has been completed • Example • Submission of n jobs on a LSF cluster • Reboot of the front end machine • The jobmanager(s) doesn’t run anymore • Orphan jobs -> Globus assumes that the jobs have been successfully completed • Globus is not able to understand if a job exited normally, or if it doesn’t run anymore for a problem (i.e. crash of the executing machine) and therefore must be re-submitted

  9. Globus Universe • Condor-G tested with: • Workstation using the fork system call • LSF Cluster • Condor pool • Submission (condor_submit), monitoring (condor_q), removing (condor_rm) seem working fine, but…

  10. Globus Universe: problems • It is not possible to have the input/output/error files in the submitting machine • Very difficult to understand about errors • Condor-G is not able to provide fault tolerance and robustness (because Globus doesn’t provide these features) • Fault tolerance only in the submitting side

  11. Jobs Condor-G Architecture condor_submit  globusrun Globus front-end machine polling (globus_job_status) Personal Condor (GlobusClient) LSF/ Condor/ PBS/ … Jobmanager Job

  12. Possible solutions • Some improvements foreseen with Condor 6.3 (but they will not solve all the problems) • Persistent Globus jobmanager • ??? • Direct interaction between Condor and local resource management systems (LSF) • Necessary to modify the Condor startd • GlideIn • Only “ready-to-use” solution if robustness is considered a fundamental requirement

  13. GlideIn • Condor daemons run on Globus resources • Local resource management systems used only to run Condor daemons • Robustness and fault tolerance • Use of Condor matchmaking system • Viable solution if the goal is just to find idle CPUs • And if we have to take into account other parameters (i.e. location of input files) ??? • Various changes have been necessary in the condor_glidein script

  14. GlideIn • GlideIn tested with: • Workstation using the fork system call as job manager • Seems working • Condor pool • Seems working • Condor flocking better solution if authentication is not required • LSF cluster • Problems (because Globus assumes SMP machines managed by LSF, while there are some problems with clusters) • Necessary to modify the Globus LSF scripts

  15. Conclusions • Major problems related with scalability and fault tolerance with Globus • Necessary to re-implement the GRAM service • The foreseen architecture doesn’t work • Personal Condor able to provide robustness only in the submitting side

More Related