370 likes | 575 Views
WLCG Overview Board CERN, 9 th March 2012. WLCG Status Report. Ian Bird. WLCG MoU Status. Since the last summer: US LLNL granted full member access as ALICE Tier2; initially based on Letter of Intent. Progress on MoU signature to be reported on at next meeting
E N D
WLCG Overview Board CERN, 9th March 2012 WLCG Status Report Ian Bird
WLCG MoU Status • Since the last summer: • US LLNL granted full member access as ALICE Tier2; initially based on Letter of Intent. Progress on MoU signature to be reported on at next meeting • Former French Tier3 LPSC Grenoble became an ALICE and ATLAS Tier2 • Informal discussions and exchange of information with 3 different new countries expressing interest in becoming WLCG Tier2s (Thailand, Cyprus, Slovakia) • WLCG MoU with EGI has now been signed • Confirms collaboration Sue.Foffano@cern.ch
WLCG: Data in 2011 HI: ALICE data into Castor > 4 GB/s (red) • Castor service at Tier 0 well adapted to the load: • Heavy Ions: more than 6 GB/s to tape (tests show that Castor can easily support >12 GB/s); Actual limit now is network from experiment to CC • Major improvements in tape efficiencies – tape writing at ~native drive speeds. Fewer drives needed HI: Overall rates to tape > 6 GB/s (r+b) 22 PB data written in 2011
Note on Castor tape speed improvements • January new Castor tape software: • fully buffered tape marks in production • 300% tape speed increase • Many other improvements in tape handling this year • E.g. Public instance (least efficient):14 MB/s 39 MB/s (2/3 buffered tape marks) 49 MB/s 146 MB/s (only buffered tape marks). • CMS instance which was better at 61 MB/s now at 157 MB/s • The current write speed (150 – 160 MB/s) is still below the native drive speed (240 MB/s) • The current bottleneck is the speed of the disks on the diskservers. Data is pre-loaded into RAM: small files fit into RAM and hence are more efficient! • We plan to reduce number of drives by 30 • Other Castor site will benefit from this; other tape systems may consider this technique…
WLCG in 2011 1.5M jobs/day 109 HEPSPEC-hours/month (~150 k CPU continuous use) • Usage continues to grow • # jobs/day • CPU usage
Use of Tier 0+1 Efficiencies now good; ALICE problem now resolved CPU and disk occupancy compared to total available for Tier 0 and Tier1s (NB “available” includes efficiency factor) Green line: avail*effic; Pink line: pledges Ian.Bird@cern.ch
Tiers usage vs pledges Ian.Bird@cern.ch
Service incidents • Fewer incidents in general • But longer lasting (or most difficult to resolve) • Q4 2011 all except 1 took >24 hr to resolve Time to resolution Ian.Bird@cern.ch
Experiment computing progress Ian.Bird@cern.ch
WLCG – no shutdown for computing Activity on 3rd Jan
ATLAS computing in 2011 incl. Christmas break, and now • Smooth running for p-p and HI in 2011 • Average 340Hz p-p, 4.6e6 life seconds, 1.6e9 events • Simulated events 2.8e9 (full) + 0.7e9 (fast) • Compression of RAW events since July (1.2 → 0.64 MB/ev), kept the compressed on disk • Dynamically extended Tier0 capacity into public share where needed, esp. for HI running – used up to 5000 cores (3000 is nominal) • One reprocessing per year • Steady improvement of reconstruction software speed to cope with high pileup • Usage over Christmas break • HI backlog processing at Tier0 until Mid December • Production of improved 7TeV MC during break for conferences using all available capacity including spillover of Grid jobs into Tier0 (left with ~500 nodes for residual HI processing) • Spillover to be refined for the long 2013-14 shutdown to include HLT • Need massive 8TeV MC production in 2012
All-Tiers and Tier-0 concurrent jobs: timeline All Tiers 100k jobs MC production Dec Jan Heavy Ions Tier-0 5000 jobs T0 usage for MC 3000 jobs
2012 expectation • 400Hz rate in physics streams • Expect LHC to run with β* = 0.6m, average 24 interactions per event(34 atbeginningatfills) • ~1.6e9 events like 2011, assuming shorter running (21 weeks), with ratio of stable beams / total physics time of 0.3 • 15Hz of Zero Bias events for pileup overlay studies and MC • Plan 75Hz rate in “delayed” streams (to be processed in 2013) • ~250e6 events • Strong physics case for B-physics • RAW written to tape ~200TB * 2 copies, processed to DAODs in 2013 on disk ~100TB * 2 copies • HLT, Tier0 and Grid processing making full use of improvements in latest software release for CPU and AOD size • High bunch charge runs of 2011 had the pileup expected for 2012 and were used in tuning
Activities Over the Winter • CMS has done a complete processing of the 2011 Data and MC with the latest CMSSW version • 8TeV MC Production began immediately after the HI run • Tier-0 resources were successfully used for MadGraph LHE production. 4B events were produced over the shutdown • Analysis activity continued at a high level in preparing for conferences.
CMS Tier-0 Data • 2012 will have higher luminosity and complex events • Excellent performance of CMSSW_5_2 for High PU events • Faster and less memory • CMS should be able to promptly reconstruct 300Hz • Even with the improvements CMS will eat more into the time between fills • We will also use more the CAF and lxbatch resources for Tier-0 reconstruction
Data Parking • Given the challenging triggering environment, potentially interesting physics, and impending long shutdown CMS would like to take more data than we have resources to reconstruct at the Tier-0 • We will repack into RAW additional data and “Park” it at Tier-1s for reconstruction at a later time • How long data stays parked will depend on available Tier-1 resources • Some data can be reconstructed during the year, and some may safely wait
Impact of Data Parking Processing • Data Parking scenarios roughly double the dataset taken in 2012 • About Half Promptly Reconstructed and Half reconstructed later at Tier-1 • We believe in 2013 we need 20% more T1 CPU resources than 2012 and 15% more T2 CPU than we estimated last year • Further increases over the small changes presented in 2011 in tape and disk storage were not expected from the planning • Primarily this is caused by changes in what CMS stores and analyzes • Write out fewer MC Raw events, as they are not needed and new MC can be recreated out of smaller formats • Analysis has moved more completely to AOD formats, which saves space • Aggressive clean-up campaigns of old MC and old processing passes also frees resources for new things
Impact of Data Parking on Analysis • Analysis Resources are well used • Additional data will have some impact in increasing analysis needs • Further constraining resources • Need to ensure that high priority activities can complete even in the presence of additional load • Stretch out lower priority activities • Expect to use the glide-in WMS global queue functionality to enforce priority
LHCb • Full 1 fb-1 of 2011 data had been re-processed by end November • MC production for 2011 configuration T2s: MC 63% Data reprocessing 28% Ian.Bird@cern.ch
LHCb • Continuing disk shortage due to larger event size and increased trigger rate • Reduced copies of data for older processing passes • Situation should improve with installation of 2012 pledges • Also commissioning physics group centralised production to reduce need for large job runs for each group • Analysis activity steady at Tier 0 and 1 • 1000 concurrent user jobs & 16000 jobs/day • Starting to prepare online farm for offline use Ian.Bird@cern.ch
LHCb event rate • During 2011, there have been a number of changes introduced in the LHCb computing model to bridge the gap between the extended physics reach of LHCb and the available pledges which were defined before this. Already in 2010, and more clearly in 2011, LHCb has decided to expand its physics reach beyond the original vision to include significantly more Charm physics. In particular, the recent observation of a possible evidence of CP in charmed meson decays, has pushed the Collaboration at the end of 2011 in a campaign of optimization of the HLT filter which will have the effect of increasing in the yield of collected charm events. We also expect a general increase in signal events due to the operation of LHC at higher energy and from exploiting the full bandwidth of the LHCb L0 trigger (1 MHz). • As a result, the trigger rate, already increased to 3 kHz in 2011 (from an original 2 kHz), will reach 4.5 kHz in 2012. The foreseen 2012 pledges will not allow to fully exploit the physics potential of the new trigger bandwidth, due to disk space limitations and therefore, unless extra resources will become available during the year, parts of recorded data will have to be "locked" during 2012, the Stripping will be tuned to produce the same data bandwidth as in 2011. However, this data will be “unlocked” in 2013 re-stripping passes by introducing additional channels and looser requirements, increasing by 50% the final output. This will allow both enhanced analyses as well as true data mining searches. Ian.Bird@cern.ch
ALICE: Computing model parameters • Parameters have been updated based on the 2011 exercise (average values over the entire pp and PbPb runs) • Running scenario 2012: • pp: 145 days (3.3×106 s effective), Lint= 4pb-1delivered, 1.4×109 events (MB + rare) • pPb: 24 days (5.2×105 s effective), Lint= 15-30nb-1delivered, 3×108 events (MB + rare)
2012 requirements 2013 requirements
Offline status Very good use of “parasitic resources” Full reconstruction of 2011 pp data, PbPb 2011 reconstructed pass 2 (ready for physics conferences!) Presently doing analysis, MC, Cosmic & Calibration Waiting for beam… Resources still tight but… New “prototype” T1s for ALICE in South Korea (KISTI) and Mexico (UNAM) InGrid2012 in Mumbai will also discuss Indian T1 Storage is critical “cleanup campaign” will help in the short term Good efficiency of production jobs (>80%)
PbPb data taking and services • Very successful data taking period: • - 140 million events, enriched with rare • triggers • HLT compression operational, x3 reduction • of RAW data volume • Reconstruction and fast QA of ~50% of the • data during the data taking period Data accumulation, peak rate up to 4GB/sec High-rate throughput test • Excellent performance of data services: • CASTOR2@CERN supported unprecedented • data transfer rates (up to 4GB/sec) • - Steady performance of tape storages T1s • Data replication completed 4 days after end • of period • Average rate 300MB/sec
Pledges vs requirements All experiments will be pushing the limits of the resources that they will have available Ian.Bird@cern.ch
Pledge installation Fears of late availability due to Thailand floods and consequent disk shortages have not materialized; little impact – except at CERN Ian.Bird@cern.ch
Status of Remote Tier 0 • Tendering process has finished • Paper for Finance Committee – next week • hopefully approval next week • Intent: • Start prototyping later this year or in 2013 • Production in 2014 Ian.Bird@cern.ch
Clouds … • The WLCG strategy in this area is a topic included in the TEGs and there will be ongoing work on the use of virtualisation and how to use clouds • Independently CERN is involved in the Helix Nebula project • EIROforum labs (CERN, EMBL, ESA, others observing) as user communities together with Industrial partners as resource providers: • Goals: • Some relevant to WLCG • Understand costs, policy issues, practicality of moving/storing/accessing data in a cloud provider • Other goals: • Specifically to address some of the data privacy issues that prevent European labs moving services or workloads to clouds Ian.Bird@cern.ch
Helix Nebula: Rationale • The EIROforum labs collaborated on a cloud strategy paper: • “The potential benefits of adopting a cloud‐based approach to the provision of computing resources have long been recognised in the business world and are readily transferable to the scientific domain. In particular, the ability to share scarce resources and absorb bursts of intensive demand can have a far‐reaching effect on the economics of research infrastructures.” • “However, Europe has failed to respond to the opportunity presented by the cloud and has not yet adopted a leadership role.”
Role of Helix Nebula: The Science Cloud Vision of a unified cloud‐based infrastructure for the ERA based on Public/Private Partnership, 4 goals building on the collective experience of all involved. • Goal One : Establish HELIX NEBULA – the Science Cloud – as a cloud computing infrastructure addressing the needs of the ERA and capable of serving as a platform for innovation and evolution of the overall e‐infrastructure. • Goal Two: Identify and adopt suitable policies for trust, security and privacy on a European‐level • Goal Three: Create a light‐weight governance structure that involves all the stakeholders ‐ and which can evolve over time as the infrastructure, services and user‐base grows. • Goal Four: Define a funding scheme involving all the stake‐holder groups (service suppliers, users, EC and national funding agencies) for PPP to implement a Cloud Computing Infrastructure that delivers a sustainable and profitable business environment adhering to European‐ level policies.
Specific outcomes • Develop strategies for extremely large or highly distributed and heterogeneous scientific data (including service architectures, applications and standardisation) in order to manage the upcoming data deluge • Analyse and promote trust building towards open scientific data e‐Infrastructures covering organisational, operational, legal and technological aspects, including authentication, authorisation and accounting (AAA) • Develop strategies and establish structures aiming at co‐ordination between e‐infrastructure operators • Create frameworks, including business models for supporting Open Science and cloud infrastructures based on PPP, useful for procurement of computing services suitable for e‐ Science
Scientific Flagships • CERN LHC (ATLAS): • High Throughput Computing and large scale data movement • EMBL: • Novel de novo genomic assembly techniques • ESA: • Integrated access to data held in existing Earth Observation “Super Sites” • Each flagship brings out very different features and requirements and exercises different aspects of a cloud offering
ATLAS use case • Simulations (~no input) with stage out to: • Traditional grid storage vs • Long term cloud storage • Data processing (== “Tier 1”) • This implies large scale data import and export to/from the cloud resource • Distributed analysis (== “Tier 2”) • Data accessed remotely (located at grid sites), or • Data located at the cloud resource (or another?) • Bursting for urgent tasks • Centrally managed: urgent processing • Regionally managed: urgent local analysis needs • All experiences immediately transferable to other LHC (& HEP) experiments
Immediate and longer term goals • Determine costs of commercial cloud resources from various sources • Compute resources • Network transfers into and out of cloud • Short and long term data storage in the cloud • Develop understanding of appropriate SLA’s • How can they be broadly applicable to LHC or HEP • Understand policy and legal constraints; e.g. in moving scientific data to commercial resources • Performance and reliability – compared to WLCG baseline • Use of standards (interfaces, etc.) & interoperability between providers • Can CERN transparently offload work to a cloud resource • Which type of work makes sense? • Long term: Can we use commercial services as a significant fraction of overall resources available to CERN experiments? • At which point is it economic/practical to rely on 3rd party providers?
What happens after EMI, EGI … • OSG will have ongoing funding • But at ~20% less than previously • EMI finishes in just over 1 year • EGI support for Heavy User Communities finishes in just over 1 year • But EGI-Inspire continues for a further year • We need to consider the sustainability of what WLCG requires • WLCG TEGs (see later talk) – produce technical strategy • Critical middleware – negotiate support direct with institutes and projects • Improve WLCG technical collaboration – from discussion in GDB and consequence of TEGs Ian.Bird@cern.ch
Conclusions • Grid operations have continued smoothly over 2011 & holiday period, no major issues • Experiments make good progress in data processing and analysis • Tier 1 and Tier 2 resources well utilized already, likely to be stretched in 2012 • 3 experiments propose to take additional data in 2012 for processing later • WLCG needs to ensure tools and support needed after EMI, and as EGI enters new phase • Several initiatives starting Ian.Bird@cern.ch