450 likes | 587 Views
Challenges and Success of HEP GRID. Faïrouz Malek, CNRS. 3rd EGEE User FORUM 2008 Clermont-Ferrand. The scales. High Energy Physics machines and detectors. pp @ √s=14 TeV L : 10 34 /cm 2 /s. L: 2.10 32 /cm 2 /s. Chambres à muons. Trajectographe. Calorim è tre. -.
E N D
Challenges and Success of HEP GRID Faïrouz Malek, CNRS 3rd EGEE User FORUM 2008Clermont-Ferrand
High Energy Physicsmachines and detectors pp @ √s=14 TeV L : 1034/cm2/s L: 2.1032 /cm2/s Chambres à muons Trajectographe Calorimètre - 2,5 million collisions per second LVL1: 10 KHz, LVL3: 50-100 Hz 25 MB/sec digitized recording 40 million collisions per second LVL1: 1 kHz, LVL3: 100 Hz 0.1 to 1 GB/sec digitized recording 3
LHC: 4 experiments … ready! First beam expected in autumn 2008 4
Professor Vangelis, what are you expecting from the LHC ? u c t g quarks d s b γ bosons de jauges νe νμ ντ Z leptons e μ τ W 1ère 2ème 3ème génération H Higgs ← CMS Simulation 5
Alas! … Hopefully ? MS is not so Standard AND …Hmmmmm … Maybe ……. Supersymetry: New world where each Boson (photon) or Fermion (e-) has Super Partner(s) New Dimensions (space) where only some particles can propagate → gravitons, new bosons … Towards String Theory… gravitation is handled by quantum mechanics. This is true only if 10 or more dimensions of space-time. Calabi-Yau 6
Physicists see online/offline TRUE (top) events @ a running D0/Fermilab experiment 7
@ CERN: Acquisition, First pass reconstruction,StorageDistribution 9
LHC computing: is it really a challenge ? Signal/Background 10-9 Data volume High rate * large number of channels * 4 experiments 15 PetaBytes of new data each year Compute power Event complexity * Nb. events * thousands users 60 k of (today's) fastest CPUs 11
Timeline LHC Computing ATLAS (or CMS) requirementsfor first year at design luminosity LHC approved 7x107 MIPS1,900 TB disk (140 MSi2K) 55x107 MIPS70,000 TB disk 107 MIPS100 TB disk ATLAS&CMSCTP “Hoffmann”Review ComputingTDRs LHCb approved ATLAS & CMS approved ALICEapproved LHC start 13
Evolution of CPU Capacity at CERN Tape & disk requirements:>10 times CERNpossibility Costs (2007 Swiss Francs) Includes infrastructurecosts (comp.centre,power, cooling, ..) and physics tapes ppbar (540GeV) SC (0.6GeV) LEP II (200GeV) ISR (300GeV) PS (28GeV) SPS (400GeV) LHC (14 TeV) LEP (100GeV) 14
Timeline Grids OSG GriPhyN, iVDGL, PPDG GRID 3 WLCG EU DataGrid EGEE 1 EGEE 2 EGEE 3 LCG 1 LCG 2 Service Challenges Cosmics First physics Data Challenges 15
WLCG Collaboration The Collaboration 4 LHC experiments ~250 computing centres 12 large centres (Tier-0, Tier-1) 38 federations of smaller “Tier-2” centres Growing to ~40 countries Grids: EGEE, OSG, Nordugrid Technical Design Reports WLCG, 4 Experiments: June 2005 Memorandum of Understanding (Agreed in October 2005) Guaranteed resources Quality of services (24/7, 4h Intervention) Resources 5-year forward look Target reliability and efficiency: 95% 17
Centers around the world form a Supercomputer • The EGEE and OSGprojects are the basis of the Worldwide LHC Computing Grid ProjectWLCG Inter-operation between Grids is working! 18
Available Infrastructure EGEE: ~250 sites, >45000 CPU OSG: ~ 15 sites for LHC, > 10000 CPU ¼ of the resources are contributed by groups external to the project ~>25 k simultaneous jobs 19
What about the Middleware ? Security Virtual Organization Management (VOMS) MyProxy Data management File catalogue (LFC) File transfer service (FTS) Storage Element (SE) Storage Resource Management (SRM) Job management Work Load Management System(WMS) Logging and Bookeeping (LB) Computing Element (CE) Worker Nodes (WN) Information System Monitoring: BDII (Berkeley Database Information Index), RGMA (Relational Grid Monitoring Architecture) aggregate service information from multiple Grid sites, now moved to SAM (Site Availability Monitoring) Monitoring & visualization (Griview, Dashboard, Gridmap etc.) 20
GRID ANALYSIS TOOLS • ATLAS • pathena/PANDA • GANGA together with the gLite and Nordugrid • CMS • CRAB together with gLite WMS and CondorG • LHCb • GANGA together with DIRAC • Alice • Alien2, PROOF 21
GANGA 22 • User friendly job submission tools • Extensible due to plugin system • Support for several applications • Athena, AthenaMC (ATLAS) • Gaudi, DaVinci (LHCb) • Others … • Support for several backends • LSF, PBS, SGE etc • gLite WMS, Nordugrid, Condor • DIRAC, PANDA • GANGA Job Building blocks • Various interfaces • Command line, IPhyton, GUI
GANGA USAGE Others LHCb ATLAS In total 968 persons since January, 579 in ATLAS Per month ~275 users, 150 in ATLAS 23
ATLAS Strategy • On the EGEE and the Nordugrid infrastructure ATLAS uses direct submission to the middleware using GANGA • EGEE: LCG RB and gLite WMS • Nordugrid: ARC middleware • On OSG PANDA system • Pilot based system • Also available at some EGEE sites 24
GANGA JOBS About 50K jobs since September Tier-1: 48% Lyon, 36% FZK 25
ATLAS Panda System • Interoperability is important • PANDA jobs on some EGEE sites • PANDA is an additional backend for GANGA • The positive aspect is that it gives ATLAS choices on how to evolve 26
CMS CRAB FEATURES • CMS Remote Analysis Builder • User oriented tool for grid submission and handling of analysis jobs • Support for gLite WMS and CondorG • Command line oriented tool • Allows to create and submit jobs, query status and retrieve output 27
CMS CRAB usage Mid-July mid-August 2007 645K jobs (20K jobs/day) – 89% grid success rate 28
OTHERS … 29 • LHCb • GANGA as user interface • DIRAC as backend • Alice • Alien2 • Alien and DIRAC are in many respects similar to PANDA
Proof PROOF cluster File catalog Storage Query PROOF query: data file list, mySelector.C Scheduler CPU’s Feedback, merged final output Master • Cluster perceived as extension of local PC • Same macro and syntax as in local session • More dynamic use of resources • Real-time feedback • Automatic splitting and merging 30
Baseline Services Storage Element Castor, dCache, DPM (with SRM 1.1) Storm added in 2007 SRM 2.2 – long delays incurred- being deployed in production Basic transfer tools – Gridftp, .. File Transfer Service (FTS) LCG File Catalog (LFC) LCG data mgt tools - lcg-utils Posix I/O – Grid File Access Library (GFAL) Synchronised databases T0T1s 3D project Information System Compute Elements Globus/Condor-C web services (CREAM) gLite Workload Management in production at CERN VO Management System (VOMS) VO Boxes Application software installation Job Monitoring Tools The Basic Baseline Services – from the TDR (2005) ... continuing evolution reliability, performance, functionality, requirements 31
3D - Distributed Deployment of Databases for LCG ORACLE Streaming with Downstream Capture (ATLAS, LHCb) SQUID/FRONTIER Web caching (CMS) 32
The usage • The number of jobs • The production • The real success !!!! 34
Site Reliability Site Reliability Tier-2 Sites 83 Tier-2 sites being monitored 37
GRID Production per Vo in one year HEP 33 million jobs ~ 110 million Norm. CPU 38
ILC, … Babar D0 HEP GRID Production in one year 39
CMS simulation2nd Term 2007 ~675 Mevents FNAL CC-IN2P3 PIC 40
ATLAS: the data chain works – Sept 2007 Tracks recorded in the muon chambers of the ATLAS detector were expressed to physicists all over the world, enabling simultaneous analysis at sites across the globe. About two million muons over two weeks were recorded. Terabytes of data were moved from the Tier-0 site at CERN to Tier-1 sites across Europe (seven sites), America (one site in America and one in Canada) and Asia (one site in Taiwan). Data transfer rates reached the expected maximum. Real analysis (in T2) happened in quasi real-timeat sites across Europe and the U.S. 41
Ramp-up Needed for Start-up 2.3 X 2.9 X 3.7 X 3 X Sep Jul Apr -06 -07 -08 Sep Jul Apr -06 -07 -08 3.7 X target usage usage pledge installed Sep Jul Apr -06 -07 -08 Jul Sep Apr -07 -07 -08 Sep Jul Apr -06 -07 -08 42
The Grid is now in operation, working on: reliability, scaling up, sustainability 43
Summary Applications support in good shape WLCG service Baseline services in production with the exception of SRM 2.2 Continuously increasing capacity and workload General site reliability is improving – but still a concern Data and storage remain the weak points Experiment testing progressing – involving now most sites, approaching full dress rehearsals Sites & experiments working well together to tackle the problems MajorCombined Computing Readiness Challenge Feb-May 2008, before the machine starts, -- essential to provide experience for site operations and storage systems – stressed simultaneously by all four experiments Steep ramp-up ahead to deliver the capacity needed for 2008 run 44
Improving Reliability 45 Monitoring Metrics Workshops Data challenges Experience Systematic problem analysis Priority from software developers