1 / 6

Alien – LCG Interface

Job submission. EDG RB. Interface Site. Server. EDG Site. AliEn CE. EDG CE. EDG UI. EDG SE. AliEn SE. WN AliEn. PFN. LFN= PFN. Data Registration. Data Registration. Data Catalogue. Replica Catalogue. LFN. LFN. Alien – LCG Interface. Status report. Alien – LCG Interface.

dawson
Download Presentation

Alien – LCG Interface

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Job submission EDG RB Interface Site Server EDG Site AliEn CE EDG CE EDG UI EDG SE AliEn SE WN AliEn PFN LFN=PFN Data Registration Data Registration Data Catalogue Replica Catalogue LFN LFN Alien – LCG Interface Status report

  2. Alien – LCG Interface • Jobs are forwarded to one (or more) LCG-2 Resource Brokers from the AliEn queue by one (ore more) dedicated interface nodes (LCG User Interfaces that run also the AliEn suite: CE, SE, FTD and Cluster Monitor). The whole LCG infrastructure is thus seen as a single, large AliEn CE. • Remote AliEn and AliRoot installation OK on all LCG sites available last Thursday (after some trivial management issues) • Job management interface works with no real problem (see next slide). Medium-scale tests OK, large-scale undergoing. • No reliable SE available on the LCG production infrastructure; generated data is always moved to CERN CASTOR as soon as the job finishes, using AliEn tools (AIOd). • An interface to LCG storage is anyhow available, and it will be tested as soon as LCG provides storage support on the EIS testbed.

  3. First event round on LCG • 480 jobs (100 cent1, 380 per5) submitted through AliEn on Friday evening, of which 476 reached the interface and were submitted to the LCG RB. Only 5 were aborted for LCG-related issues. • The AliEn central server is being tuned for the production, so many jobs (including, unfortunately, most of the cent1 jobs) crashed when the server (and in some occurrences the gateway node services) were restarted. Most of the jobs apparently ran correctly, but AliEn lost contact with them. 157 reached the end. • 161 jobs were executed at RAL, 161 in Taiwan, 3 at CNAF, 104 at CERN, 39 in Karlsruhe and 3 at NIKHEF (all of which, however, were aborted). • The uneven distribution of the jobs across centres was due to a wrong config file on the interface machine: the default ranking was not other.GlueCEStateFreeCPUs but the infamous –other.GlueCEStateEstimatedResponseTime. • On Sunday morning, 250 more jobs (150 per5, 100 cent1) were submitted; at 17.00 all of the per5 jobs but one were correctly done (53 at CERN, 17 at FZK, 33 at RAL and 48 in Taiwan) and all of the cent1 were running. On Monday, all of the AliEn services were stopped for upgrading. The 31 cent1 jobs that were not finished yet were killed (see summary on next slide)

  4. Very first event round on LCG • OK: as reported by AliEn. Output transfered to CERN CASTOR and registered on AliEn Data Catalogue • Aborted by LCG: reported as “Aborted” by LB. • Zombi: lost contact between AliEn and the job. All due to server and gateway restarts, many probably finished correctly on LCG. • Aborted by AliEn: failed. Many due to server and gateway problems since then fixed. • Killed: killed by AliEn when stopped for upgrades on Monday, April 1st

  5. Real production starting on LCG 1200 central jobs executed since the beginnning • Most of the jobs were executed by CNAF and RAL, with CERN and FZK following. • Taiwan problem: AliRoot installation missing, under investigation. • NIKHEF problem: AliEn installation disappeared • Ranking problem: default ranking –other.GlueCEStateEstimatedResponseTime does not work, so use other.GlueCEStateFreeCPUs (with “corrections”) • CNAF problem: does not report correctly number of available CPUs and running jobs, so will keep on accepting jobs forever • CERN, RAL, FZK (and CNAF, apart from ranking problem) OK. • Another interface machine in Torino will soon start submitting jobs to grid.it

  6. To Do List • Better bookkeeping infrastructure: it is still awkward, for example, to get site statistics. • Automatic installation and upgrade: Installation (done sending a special job to the site) should be automatically triggered by AliEn (problem with NIKHEF) • Deploy another interface to LCG to share the load and provide redundancy. How many concurrent jobs can it withstand? • Deploy a test interface to the EIS testbed • Test LCG storage insterface: not available on LCG production infrastructure, but ready to test on the EIS testbed

More Related