1 / 45

Challenges and Success of HEP GRID

Challenges and Success of HEP GRID. Faïrouz Malek, CNRS. 3rd EGEE User FORUM 2008 Clermont-Ferrand. The scales. High Energy Physics machines and detectors. pp @ √s=14 TeV L : 10 34 /cm 2 /s. L: 2.10 32 /cm 2 /s. Chambres à muons. Trajectographe. Calorim è tre. -.

Download Presentation

Challenges and Success of HEP GRID

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Challenges and Success of HEP GRID Faïrouz Malek, CNRS 3rd EGEE User FORUM 2008Clermont-Ferrand

  2. The scales 2

  3. High Energy Physicsmachines and detectors pp @ √s=14 TeV L : 1034/cm2/s L: 2.1032 /cm2/s Chambres à muons Trajectographe Calorimètre - 2,5 million collisions per second LVL1: 10 KHz, LVL3: 50-100 Hz 25 MB/sec digitized recording 40 million collisions per second LVL1: 1 kHz, LVL3: 100 Hz 0.1 to 1 GB/sec digitized recording 3

  4. LHC: 4 experiments … ready! First beam expected in autumn 2008 4

  5. Professor Vangelis, what are you expecting from the LHC ? u c t g quarks d s b γ bosons de jauges νe νμ ντ Z leptons e μ τ W 1ère 2ème 3ème génération H Higgs ← CMS Simulation 5

  6. Alas! … Hopefully ? MS is not so Standard AND …Hmmmmm … Maybe ……. Supersymetry: New world where each Boson (photon) or Fermion (e-) has Super Partner(s) New Dimensions (space) where only some particles can propagate → gravitons, new bosons … Towards String Theory… gravitation is handled by quantum mechanics. This is true only if 10 or more dimensions of space-time. Calabi-Yau 6

  7. Physicists see online/offline TRUE (top) events @ a running D0/Fermilab experiment 7

  8. A collision @ LHC 8

  9. @ CERN: Acquisition, First pass reconstruction,StorageDistribution 9

  10. The Data Acquisition 10

  11. LHC computing: is it really a challenge ? Signal/Background 10-9 Data volume High rate * large number of channels * 4 experiments  15 PetaBytes of new data each year Compute power Event complexity * Nb. events * thousands users  60 k of (today's) fastest CPUs 11

  12. Options as seen in 1996Before the GRID was invented 12

  13. Timeline LHC Computing ATLAS (or CMS) requirementsfor first year at design luminosity LHC approved 7x107 MIPS1,900 TB disk (140 MSi2K) 55x107 MIPS70,000 TB disk 107 MIPS100 TB disk ATLAS&CMSCTP “Hoffmann”Review ComputingTDRs LHCb approved ATLAS & CMS approved ALICEapproved LHC start 13

  14. Evolution of CPU Capacity at CERN Tape & disk requirements:>10 times CERNpossibility Costs (2007 Swiss Francs) Includes infrastructurecosts (comp.centre,power, cooling, ..) and physics tapes ppbar (540GeV) SC (0.6GeV) LEP II (200GeV) ISR (300GeV) PS (28GeV) SPS (400GeV) LHC (14 TeV) LEP (100GeV) 14

  15. Timeline Grids OSG GriPhyN, iVDGL, PPDG GRID 3 WLCG EU DataGrid EGEE 1 EGEE 2 EGEE 3 LCG 1 LCG 2 Service Challenges Cosmics First physics Data Challenges 15

  16. The Tiers ModelTier-0 -1 -2 16

  17. WLCG Collaboration The Collaboration 4 LHC experiments ~250 computing centres 12 large centres (Tier-0, Tier-1) 38 federations of smaller “Tier-2” centres Growing to ~40 countries Grids: EGEE, OSG, Nordugrid Technical Design Reports WLCG, 4 Experiments: June 2005 Memorandum of Understanding (Agreed in October 2005) Guaranteed resources Quality of services (24/7, 4h Intervention) Resources 5-year forward look Target reliability and efficiency: 95% 17

  18. Centers around the world form a Supercomputer • The EGEE and OSGprojects are the basis of the Worldwide LHC Computing Grid ProjectWLCG Inter-operation between Grids is working! 18

  19. Available Infrastructure EGEE: ~250 sites, >45000 CPU OSG: ~ 15 sites for LHC, > 10000 CPU ¼ of the resources are contributed by groups external to the project ~>25 k simultaneous jobs 19

  20. What about the Middleware ? Security Virtual Organization Management (VOMS) MyProxy Data management File catalogue (LFC) File transfer service (FTS) Storage Element (SE) Storage Resource Management (SRM) Job management Work Load Management System(WMS) Logging and Bookeeping (LB) Computing Element (CE) Worker Nodes (WN) Information System Monitoring: BDII (Berkeley Database Information Index), RGMA (Relational Grid Monitoring Architecture)  aggregate service information from multiple Grid sites, now moved to SAM (Site Availability Monitoring) Monitoring & visualization (Griview, Dashboard, Gridmap etc.) 20

  21. GRID ANALYSIS TOOLS • ATLAS • pathena/PANDA • GANGA together with the gLite and Nordugrid • CMS • CRAB together with gLite WMS and CondorG • LHCb • GANGA together with DIRAC • Alice • Alien2, PROOF 21

  22. GANGA 22 • User friendly job submission tools • Extensible due to plugin system • Support for several applications • Athena, AthenaMC (ATLAS) • Gaudi, DaVinci (LHCb) • Others … • Support for several backends • LSF, PBS, SGE etc • gLite WMS, Nordugrid, Condor • DIRAC, PANDA • GANGA Job Building blocks • Various interfaces • Command line, IPhyton, GUI

  23. GANGA USAGE Others LHCb ATLAS In total 968 persons since January, 579 in ATLAS Per month ~275 users, 150 in ATLAS 23

  24. ATLAS Strategy • On the EGEE and the Nordugrid infrastructure ATLAS uses direct submission to the middleware using GANGA • EGEE: LCG RB and gLite WMS • Nordugrid: ARC middleware • On OSG PANDA system • Pilot based system • Also available at some EGEE sites 24

  25. GANGA JOBS About 50K jobs since September Tier-1: 48% Lyon, 36% FZK 25

  26. ATLAS Panda System • Interoperability is important • PANDA jobs on some EGEE sites • PANDA is an additional backend for GANGA • The positive aspect is that it gives ATLAS choices on how to evolve 26

  27. CMS CRAB FEATURES • CMS Remote Analysis Builder • User oriented tool for grid submission and handling of analysis jobs • Support for gLite WMS and CondorG • Command line oriented tool • Allows to create and submit jobs, query status and retrieve output 27

  28. CMS CRAB usage Mid-July  mid-August 2007 645K jobs (20K jobs/day) – 89% grid success rate 28

  29. OTHERS … 29 • LHCb • GANGA as user interface • DIRAC as backend • Alice • Alien2 • Alien and DIRAC are in many respects similar to PANDA

  30. Proof PROOF cluster File catalog Storage Query PROOF query: data file list, mySelector.C Scheduler CPU’s Feedback, merged final output Master • Cluster perceived as extension of local PC • Same macro and syntax as in local session • More dynamic use of resources • Real-time feedback • Automatic splitting and merging 30

  31. Baseline Services Storage Element Castor, dCache, DPM (with SRM 1.1) Storm added in 2007 SRM 2.2 – long delays incurred- being deployed in production Basic transfer tools – Gridftp, .. File Transfer Service (FTS) LCG File Catalog (LFC) LCG data mgt tools - lcg-utils Posix I/O – Grid File Access Library (GFAL) Synchronised databases T0T1s 3D project Information System Compute Elements Globus/Condor-C web services (CREAM) gLite Workload Management in production at CERN VO Management System (VOMS) VO Boxes Application software installation Job Monitoring Tools The Basic Baseline Services – from the TDR (2005) ... continuing evolution reliability, performance, functionality, requirements 31

  32. 3D - Distributed Deployment of Databases for LCG ORACLE Streaming with Downstream Capture (ATLAS, LHCb) SQUID/FRONTIER Web caching (CMS) 32

  33. LHCOPN Architecture 33

  34. The usage • The number of jobs • The production • The real success !!!! 34

  35. Data Transfer out of Tier-0 35

  36. Site reliability 36

  37. Site Reliability Site Reliability Tier-2 Sites 83 Tier-2 sites being monitored 37

  38. GRID Production per Vo in one year HEP 33 million jobs ~ 110 million Norm. CPU 38

  39. ILC, … Babar D0 HEP GRID Production in one year 39

  40. CMS simulation2nd Term 2007 ~675 Mevents FNAL CC-IN2P3 PIC 40

  41. ATLAS: the data chain works – Sept 2007 Tracks recorded in the muon chambers of the ATLAS detector were expressed to physicists all over the world, enabling simultaneous analysis at sites across the globe. About two million muons over two weeks were recorded. Terabytes of data were moved from the Tier-0 site at CERN to Tier-1 sites across Europe (seven sites), America (one site in America and one in Canada) and Asia (one site in Taiwan). Data transfer rates reached the expected maximum. Real analysis (in T2) happened in quasi real-timeat sites across Europe and the U.S. 41

  42. Ramp-up Needed for Start-up 2.3 X 2.9 X 3.7 X 3 X Sep Jul Apr -06 -07 -08 Sep Jul Apr -06 -07 -08 3.7 X target usage usage pledge installed Sep Jul Apr -06 -07 -08 Jul Sep Apr -07 -07 -08 Sep Jul Apr -06 -07 -08 42

  43. The Grid is now in operation, working on: reliability, scaling up, sustainability 43

  44. Summary Applications support in good shape WLCG service Baseline services in production with the exception of SRM 2.2 Continuously increasing capacity and workload General site reliability is improving – but still a concern Data and storage remain the weak points Experiment testing progressing – involving now most sites, approaching full dress rehearsals Sites & experiments working well together to tackle the problems MajorCombined Computing Readiness Challenge Feb-May 2008, before the machine starts, -- essential to provide experience for site operations and storage systems – stressed simultaneously by all four experiments Steep ramp-up ahead to deliver the capacity needed for 2008 run 44

  45. Improving Reliability 45 Monitoring Metrics Workshops Data challenges Experience Systematic problem analysis Priority from software developers

More Related