1 / 17

CREAM Status and plans

CREAM Status and plans. Massimo Sgaravatto INFN Padova On behalf of the CREAM product team. Status. CREAM CE v. 1.5 now in production In gLite 3.1 (sl4_i386) In gLite 3.2 (sl5_x86_64) CREAM CE v. 1.6 for gLite 3.2 / sl5_x86_64 certified 2 weeks ago

Download Presentation

CREAM Status and plans

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CREAMStatus and plans Massimo Sgaravatto INFN Padova On behalf of the CREAM product team

  2. Status • CREAM CE v. 1.5 now in production • In gLite 3.1 (sl4_i386) • In gLite 3.2 (sl5_x86_64) • CREAM CE v. 1.6 for gLite 3.2 / sl5_x86_64 certified 2 weeks ago • Certified according to the new certification model • While trying it at UCSD last week (in the context of the OSG CREAM evaluation) bug #64516 in trustmanager was found • Issue not in CREAM, but affecting (also) CREAM • It affects certificates signed by the DoE CA •  Needed to rebuild against a fixed version of trustmanager, when ready GDB - Amsterdam, March 24, 2010

  3. New certification model • Each middleware product team is responsible to certify its software • Two type of patches • Metapackage patch: a patch corresponding to a Grid node (e.g. CREAM CE, WMS node, LB server, …) • Internal patch: software to be included in one or more metapackage pacthes (e.g. trustmanager, lcas, lcmaps, voms, etc.) • When a metapackage patch is certified, the relevant internal patches certified so far are also “included” • Old style patches not accepted anymore by the gLite release team since December 9, 2009 • The transition of the new certification model is taking much more than originally hoped • Certification of patches wrt the new model possible for gLite 3.2 / sl5_x86_64 since ~ mid of February • Still not technically possible to certify patches for gLite 3.1 / sl4_i386 • Not only for CREAM, but for gLite in general GDB - Amsterdam, March 24, 2010

  4. CREAM CE 1.6 • New operation QueryEvent • To be used by WMS (ICE) • To improve ICE’s job status changes detection • Scalability problems in current (CEMon + polling) approach • Glexec  sudo • Glexec used only once per job submission (just to find the uid to be used in sudo calls) • Better performance • Will facilitate the integration with Argus • New BLAH BLparser for LSF and Torque/PBS • Using the status/history commands instead of parsing the log files • Log files might still be needed by the batch system commands (e.g. tracejob, bhist) • Allows also easier configuration • Not needed anymore to configure the CE and then the blparser • Old parser still supported • At configuration time the BLparser type (new/old) can be chosen GDB - Amsterdam, March 24, 2010

  5. CREAM CE 1.6 (cont.ed) • Support of SGE in BLAH (as contributed by CESGA/LIP) • Limiter to protect CREAM when the machine is overloaded • New job submissions are blocked when this happens • Taking into account load, memory usage, # of file descriptors, etc. • Very similar to the limiter used in the WMS • Proxy purger • To clean from the delegationDB and from the file system expired delegations • Support of file transfers from/to gridftp servers started with user credentials • Asked by Condor • Improved CREAM startup • It could take a while if needed to get the status of jobs in a non-terminal status GDB - Amsterdam, March 24, 2010

  6. CREAM CE 1.6 (cont.ed) • Improved proxy renewals when there are multiple jobs sharing the same delegationid • Typical use case considering at least the submissions from WMS and from Condor • Improved performance of some DB queries • Several bug fixes • User ‘tomcat’ not added anymore to VO groups and glexec group • Glexec’s lcmaps conf file fixed • It could happen that you get mapped to a user different than the one mapped by gridftpd • GLITE_WMS_RB_BROKERINFO properly set in CREAM jobwrapper • … • Configuration changes already communicated to M. Jouvin for Quattor QWG templates GDB - Amsterdam, March 24, 2010

  7. Batch system support status • Batch system support is coordinated by NIKHEF/SA3 • A batch system xyz is supported in the CREAM CE • When xyz is supported in BLAH • When glite-xyz-utils is provided • Apel • Information providers • Configuration (yaim-xyz) • Torque/PBS • Supported in gLite 3.1 and gLite 3.2 • Also new blparser available with CREAM CE 1.6 • LSF • Supported in glite 3.1 • Supported in glite 3.2 when glite-lsf-utils is released • Current status of relevant patch (#3403): ready for production • Also new blparser available with CREAM CE 1.6 GDB - Amsterdam, March 24, 2010

  8. Batch system support status (cont.ed) • SGE • CESGA-LIP responsibility • Support in BLAH provided with CREAM CE 1.6 • glite-sge-utils provided with patch #3764 • Current status: ready for roll-out • A couple of sites already trying it • Uni-Muenchen, LIP • Only new blparser • Condor • PIC responsibility • Support in BLAH in place since a while • glite-condor-utils still missing in gLite 3.2 • According to Pau Tallada (PIC) they are close to finalize it • In the meantime CREAM with Condor is possible with some manual configurations • E.g. this is what was done at UCSD (OSG), where they use Condor as batch system • Only new blparser • BQS • “Running in production and available to the 4 LHC VOs” (S. Reynaud) GDB - Amsterdam, March 24, 2010

  9. WMS  CREAM • Main issue is detection of job status changes by the ICE component of the WMS • Job status changes is some cases are not detected by ICE (bug #61405) •  Job finished, but reported as Running wrt WMS/LB • Job status changes might be detected late • Current approach based on CEMon notification + polling doesn’t scale • Problems addressed in WMS 3.2.14 (patch #3621) for gLite 3.1 • Bug fixes and use of the new QueryEvent operation provided by CREAM CE 1.6 • Our tests looks promising wrt ICE (see next slide) • In some tests main bottleneck now appears to be the LB, in some cases (in particular when the LB DB not properly configured) not able to sustain a high submission rate • Stuck in finalizing and certifying patch #3621 • Metapackage preparation, etc. • Still not possible to certify patches for gLite 3.1 GDB - Amsterdam, March 24, 2010

  10. Job status changes detection by ICE GDB - Amsterdam, March 24, 2010

  11. Interoperability • New ICE (wms 3.2.14)  New CREAM CE (CREAM 1.6) • Job status changes supposed to be all detected and much more promptly than now • New ICE (wms 3.2.14)  Old CREAM CE (CREAM < 1.6) • Just status changes supposed to be all detected if there is a valid proxy around (i.e. no more bug #61405) • But there could be still problems in promptly detecting job status changes • Old ICE (the one in production now)  New CREAM CE (CREAM 1.6) • Working not worse than now GDB - Amsterdam, March 24, 2010

  12. Condor  CREAM • Issues • Lease not renewed • Confirmed by Condor people that the problem is in the Condor side • They are going to provide a fix in ~ 2 weeks • Problems with proxy renewals • Noticed that the proxy renewal is done very (too) often by Condor • Still not fully clear why • Proxy renewal not very efficient in CREAM when many jobs use the same delegationid • CREAM can take too much to satisfy such requests and the queue of commands to be executed in CREAM can grow too much • Problem fixed in CREAM CE 1.6 • Overload of the gridftp server on the Condor host when many short jobs start together • Condor is going to use a different approach for sandbox management (see next slide) GDB - Amsterdam, March 24, 2010

  13. Sandbox transferring • Condor (and WMS) now uses 1 • File transfers 1 done when job starts running • Gridftpd on Condor host can get overloaded when many job starts running together • Going to move to 2 • 2a done when job is submitted • 2b done when job starts running • OSG is also asking to use batch system staging facilities instead of gridftp for 2b • Likely appropriate (only) if Condor is used as batch system • With e.g. Torque (when ssh id used) I am afraid it will make things worst •  it will be configurable Condor submitting host 1 2a Job 2b CREAM CE WN GDB - Amsterdam, March 24, 2010

  14. Output Sandbox • Right now in the JDL it is necessary to specify where (which gridftp/https) the OSB must be staged • LHCB has very recently (last week) asked for a different approach • Possibility to store the OSB files in the CREAM CE • Possibility to then retrieve them using a glite-ce-job-output command • This was also discussed with Alice time ago, but the outcomes was using gridftp servers on their VOBOXes • Also M. Jouvin raised the issue recently • This can be done, but requires some work • Requires also some “space management” in the CREAM CE • Enforcing of max sandbox size, purging of old sandboxes when the free disk space is getting too low • Still to understand how critical is this request, in order to evaluate how the current plans must be modified GDB - Amsterdam, March 24, 2010

  15. Other issues • LCG-CE coupled with CREAM-CE • Having a cluster used by a gLite 3.1/sl4 LCG CE and by a gLite 3.2/sl5 CREAM CE is a common use case • Open issue for Torque: maui client/server mismatch (bug #61698) • Unfortunately this is not in our domain • Documented a workaround (suggested in the LCG-ROLLOUT mailing list) in the CREAM known issue page (http://grid.pd.infn.it/cream/field.php?n=Main.KnownIssues) • Some recent requests by LHCB • Define in the env of the job the CREAMjobid • Needed for some DIRAC monitoring issues • Easy to be done • To be provided with CREAM CE 1.6 (since we have to rebuild it for the trustmanager issue) GDB - Amsterdam, March 24, 2010

  16. What’s next (current plan) ? • Certification of CREAM CE 1.6 client for gLite 3.2/sl5_x86_64 (patch #3671): on-going • Support of QueryEvent operation • Some minor fixes • Certification supposed to be quite fast • Certification of CREAM CE 1.6 (server side) for gLite 3.1/sl4_i386 (patch #3898): • Not possible now (as all patches for gLite 3.1) • Wrt CREAM same software than the one used for gLite 3.2 (i.e. ready) • … but build and run against different versions of other software components • Certification of CREAM CE 1.6 client for gLite 3.1/sl4 • Not possible now (as all patches for gLite 3.1) • New ICE for gLite 3.1 (provided with WMS 3.2.14): patch #3621 • Certification not possible now (as all patches for gLite 3.1) • CREAM CE 1.6 client must be certified first • CREAM CE 1.7 GDB - Amsterdam, March 24, 2010

  17. CREAM CE v. 1.7 • Integration with Argus • Argus used to decide if a certain operation on a CREAM CE is authorized • Also used to get the local user id • Single AuthZ system in the CREAM CE • Now there is AuthZ layer in CREAM, LCAS/LCMAPS for glexec, LCAS/LCMAPS for gridftpd • Because of bugs/misconfigurations inconsistent authZ decisions could be taken • Also gridftpd integrated with Argus • Not using glexec (and dependencies) anymore • Initially the old code will be maintained • At configuration possible to decide if Argus or “old system” has to be used • Some work still to be done also in the Argus side • Bug fixes • Unlikely to be finalized by the end of EGEE-III GDB - Amsterdam, March 24, 2010

More Related