Challenges and Success of HEP GRID

Challenges and Success of HEP GRID Faïrouz Malek, CNRS 3rd EGEE User FORUM 2008Clermont-Ferrand

The scales 2

High Energy Physicsmachines and detectors pp @ √s=14 TeV L : 1034/cm2/s L: 2.1032 /cm2/s Chambres à muons Trajectographe Calorimètre - 2,5 million collisions per second LVL1: 10 KHz, LVL3: 50-100 Hz 25 MB/sec digitized recording 40 million collisions per second LVL1: 1 kHz, LVL3: 100 Hz 0.1 to 1 GB/sec digitized recording 3

LHC: 4 experiments … ready! First beam expected in autumn 2008 4

Professor Vangelis, what are you expecting from the LHC ? u c t g quarks d s b γ bosons de jauges νe νμ ντ Z leptons e μ τ W 1ère 2ème 3ème génération H Higgs ← CMS Simulation 5

Alas! … Hopefully ? MS is not so Standard AND …Hmmmmm … Maybe ……. Supersymetry: New world where each Boson (photon) or Fermion (e-) has Super Partner(s) New Dimensions (space) where only some particles can propagate → gravitons, new bosons … Towards String Theory… gravitation is handled by quantum mechanics. This is true only if 10 or more dimensions of space-time. Calabi-Yau 6

Physicists see online/offline TRUE (top) events @ a running D0/Fermilab experiment 7

A collision @ LHC 8

@ CERN: Acquisition, First pass reconstruction,StorageDistribution 9

The Data Acquisition 10

LHC computing: is it really a challenge ? Signal/Background 10-9 Data volume High rate * large number of channels * 4 experiments  15 PetaBytes of new data each year Compute power Event complexity * Nb. events * thousands users  60 k of (today's) fastest CPUs 11

Options as seen in 1996Before the GRID was invented 12

Timeline LHC Computing ATLAS (or CMS) requirementsfor first year at design luminosity LHC approved 7x107 MIPS1,900 TB disk (140 MSi2K) 55x107 MIPS70,000 TB disk 107 MIPS100 TB disk ATLAS&CMSCTP “Hoffmann”Review ComputingTDRs LHCb approved ATLAS & CMS approved ALICEapproved LHC start 13

Evolution of CPU Capacity at CERN Tape & disk requirements:>10 times CERNpossibility Costs (2007 Swiss Francs) Includes infrastructurecosts (comp.centre,power, cooling, ..) and physics tapes ppbar (540GeV) SC (0.6GeV) LEP II (200GeV) ISR (300GeV) PS (28GeV) SPS (400GeV) LHC (14 TeV) LEP (100GeV) 14

Timeline Grids OSG GriPhyN, iVDGL, PPDG GRID 3 WLCG EU DataGrid EGEE 1 EGEE 2 EGEE 3 LCG 1 LCG 2 Service Challenges Cosmics First physics Data Challenges 15

The Tiers ModelTier-0 -1 -2 16

WLCG Collaboration The Collaboration 4 LHC experiments ~250 computing centres 12 large centres (Tier-0, Tier-1) 38 federations of smaller “Tier-2” centres Growing to ~40 countries Grids: EGEE, OSG, Nordugrid Technical Design Reports WLCG, 4 Experiments: June 2005 Memorandum of Understanding (Agreed in October 2005) Guaranteed resources Quality of services (24/7, 4h Intervention) Resources 5-year forward look Target reliability and efficiency: 95% 17

Centers around the world form a Supercomputer • The EGEE and OSGprojects are the basis of the Worldwide LHC Computing Grid ProjectWLCG Inter-operation between Grids is working! 18

Available Infrastructure EGEE: ~250 sites, >45000 CPU OSG: ~ 15 sites for LHC, > 10000 CPU ¼ of the resources are contributed by groups external to the project ~>25 k simultaneous jobs 19

What about the Middleware ? Security Virtual Organization Management (VOMS) MyProxy Data management File catalogue (LFC) File transfer service (FTS) Storage Element (SE) Storage Resource Management (SRM) Job management Work Load Management System(WMS) Logging and Bookeeping (LB) Computing Element (CE) Worker Nodes (WN) Information System Monitoring: BDII (Berkeley Database Information Index), RGMA (Relational Grid Monitoring Architecture)  aggregate service information from multiple Grid sites, now moved to SAM (Site Availability Monitoring) Monitoring & visualization (Griview, Dashboard, Gridmap etc.) 20

GRID ANALYSIS TOOLS • ATLAS • pathena/PANDA • GANGA together with the gLite and Nordugrid • CMS • CRAB together with gLite WMS and CondorG • LHCb • GANGA together with DIRAC • Alice • Alien2, PROOF 21

GANGA 22 • User friendly job submission tools • Extensible due to plugin system • Support for several applications • Athena, AthenaMC (ATLAS) • Gaudi, DaVinci (LHCb) • Others … • Support for several backends • LSF, PBS, SGE etc • gLite WMS, Nordugrid, Condor • DIRAC, PANDA • GANGA Job Building blocks • Various interfaces • Command line, IPhyton, GUI

GANGA USAGE Others LHCb ATLAS In total 968 persons since January, 579 in ATLAS Per month ~275 users, 150 in ATLAS 23

ATLAS Strategy • On the EGEE and the Nordugrid infrastructure ATLAS uses direct submission to the middleware using GANGA • EGEE: LCG RB and gLite WMS • Nordugrid: ARC middleware • On OSG PANDA system • Pilot based system • Also available at some EGEE sites 24

GANGA JOBS About 50K jobs since September Tier-1: 48% Lyon, 36% FZK 25

ATLAS Panda System • Interoperability is important • PANDA jobs on some EGEE sites • PANDA is an additional backend for GANGA • The positive aspect is that it gives ATLAS choices on how to evolve 26

CMS CRAB FEATURES • CMS Remote Analysis Builder • User oriented tool for grid submission and handling of analysis jobs • Support for gLite WMS and CondorG • Command line oriented tool • Allows to create and submit jobs, query status and retrieve output 27

CMS CRAB usage Mid-July  mid-August 2007 645K jobs (20K jobs/day) – 89% grid success rate 28

OTHERS … 29 • LHCb • GANGA as user interface • DIRAC as backend • Alice • Alien2 • Alien and DIRAC are in many respects similar to PANDA

Proof PROOF cluster File catalog Storage Query PROOF query: data file list, mySelector.C Scheduler CPU’s Feedback, merged final output Master • Cluster perceived as extension of local PC • Same macro and syntax as in local session • More dynamic use of resources • Real-time feedback • Automatic splitting and merging 30

Baseline Services Storage Element Castor, dCache, DPM (with SRM 1.1) Storm added in 2007 SRM 2.2 – long delays incurred- being deployed in production Basic transfer tools – Gridftp, .. File Transfer Service (FTS) LCG File Catalog (LFC) LCG data mgt tools - lcg-utils Posix I/O – Grid File Access Library (GFAL) Synchronised databases T0T1s 3D project Information System Compute Elements Globus/Condor-C web services (CREAM) gLite Workload Management in production at CERN VO Management System (VOMS) VO Boxes Application software installation Job Monitoring Tools The Basic Baseline Services – from the TDR (2005) ... continuing evolution reliability, performance, functionality, requirements 31

3D - Distributed Deployment of Databases for LCG ORACLE Streaming with Downstream Capture (ATLAS, LHCb) SQUID/FRONTIER Web caching (CMS) 32

LHCOPN Architecture 33

The usage • The number of jobs • The production • The real success !!!! 34

Data Transfer out of Tier-0 35

Site reliability 36

Site Reliability Site Reliability Tier-2 Sites 83 Tier-2 sites being monitored 37

GRID Production per Vo in one year HEP 33 million jobs ~ 110 million Norm. CPU 38

ILC, … Babar D0 HEP GRID Production in one year 39

CMS simulation2nd Term 2007 ~675 Mevents FNAL CC-IN2P3 PIC 40

ATLAS: the data chain works – Sept 2007 Tracks recorded in the muon chambers of the ATLAS detector were expressed to physicists all over the world, enabling simultaneous analysis at sites across the globe. About two million muons over two weeks were recorded. Terabytes of data were moved from the Tier-0 site at CERN to Tier-1 sites across Europe (seven sites), America (one site in America and one in Canada) and Asia (one site in Taiwan). Data transfer rates reached the expected maximum. Real analysis (in T2) happened in quasi real-timeat sites across Europe and the U.S. 41

Ramp-up Needed for Start-up 2.3 X 2.9 X 3.7 X 3 X Sep Jul Apr -06 -07 -08 Sep Jul Apr -06 -07 -08 3.7 X target usage usage pledge installed Sep Jul Apr -06 -07 -08 Jul Sep Apr -07 -07 -08 Sep Jul Apr -06 -07 -08 42

The Grid is now in operation, working on: reliability, scaling up, sustainability 43

Summary Applications support in good shape WLCG service Baseline services in production with the exception of SRM 2.2 Continuously increasing capacity and workload General site reliability is improving – but still a concern Data and storage remain the weak points Experiment testing progressing – involving now most sites, approaching full dress rehearsals Sites & experiments working well together to tackle the problems MajorCombined Computing Readiness Challenge Feb-May 2008, before the machine starts, -- essential to provide experience for site operations and storage systems – stressed simultaneously by all four experiments Steep ramp-up ahead to deliver the capacity needed for 2008 run 44

Improving Reliability 45 Monitoring Metrics Workshops Data challenges Experience Systematic problem analysis Priority from software developers

Challenges and Success of HEP GRID

Challenges and Success of HEP GRID

Presentation Transcript

Taking stock of Grid technologies - accomplishments and challenges

Grid Canada Testbed using HEP applications

Challenges… and Some Success

Smart Grid: Status and Challenges

The Challenges of Grid Computing

Hungarian HEP-related GRID activities

National HEP Data Grid Project in Korea

UK HEP Grid Deployment

( HEP ) GRID Activities in Hungary

The LCG-GRID infrastructure of HEP/NTUA Lab

Overview of UK HEP Grid

HEP GRID computing in Poland

The German HEP Community Grid

How (and why) HEP uses the Grid.

HEP Data Grid in Japan

Operating the LCG and EGEE Production Grid for HEP

Grid teaching challenges

Challenges of Analysis for Grid Computing

Challenges, Success and Benefits of Regional Projects:

HEP grid computing in Portugal

IAG – Israel Academic Grid, EGEE and HEP in Israel