220 likes | 325 Views
Review of 2012 Distributed Computing Operations. Stefan Roiser 66 th LHCb Week – NCB Meeting 26 November, 2012. Content. 2012 data processing activities Current issues and work in progress Processing outlook for 2013 and beyond Conclusion. 2012 Activities.
E N D
Review of 2012 Distributed Computing Operations Stefan Roiser 66thLHCb Week – NCB Meeting 26 November, 2012
Content • 2012 data processing activities • Current issues and work in progress • Processing outlook for 2013 and beyond • Conclusion 2012 Operations
2012 Activities 2012 Operations
2012 Processing Overview(all plots of this talk since 1 Jan) User Total All successful Jobs of 2012 Simulation Simulation CPU Efficiency of Successful jobs Recon struction Stripping Reprocessing Repro cessing Prompt Processing User Strip ping LHCb was making good use of the provided resources All successful Work grouped by Activity 2012 Operations
Successful Work in 2012 by Site Total 3.5 M CPU days 2012 Operations
Job Characteristics • No input data • Uploading output to “closest” storage element MC Simulation • Mostly running at Tier2 sites • But also at Tier0/1 when resources are available • Usually runningwith lowestpriority • Low error rate as also true for otherproductionactivities 2012 Operations
Job Characteristics • At Tier0/1 sites • 3GB downl input from tape • ~5 GB output to tape • Job duration ~ 36 hours Prompt Reconstruction • First pass reconstruction of detector data • Usually 100 % of RAW files are processed. Since reprocessing started, only partial (~30%) reconstruction at CERN + “attached T2s” 2012 Operations
Job Characteristics • Same as prompt reco • + also running at Tier2 sites Data Reprocessing • Reprocessing of 2012 data started mid Sept. • Pushing system to its limits • Running up to 15k reconstruction jobs was a very good stress test for post LS1 prompt processing • Data processing smooth • Hitting limits with data movement • E.g. staging from Tape(see Philippe’s talk) 2012 Operations
2012 Data Reprocessing 41 Tier 2 sites involved in the activity, downloading RAW file from T1 storage and providing ~ 50 % of work for data reconstruction jobs 2012 Operations
“Attached” Tier 2 Sites • Page providing updated status of “attached T2 sites” and their storage at http://lhcbproject.web.cern.ch/lhcbproject/Reprocessing/sites.html • Useful for Tier2 sites to know from/to where they receive/provide the data for reprocessing • Since this year we have the possibility (and used it) to re-attach sites to another storage element when processing power was needed 2012 Operations
Job Characteristics • T0/1 sites • Remote input data access • Duration ?? User Activities Higher activities until summer, since then less amount of running jobs ICHEP Constant “background” of failed jobs Less submissions during week-end 2012 Operations
Issues and work in progress 2012 Operations
Issues at Sites • Mostly “business as usual” • Pilots aborted, memory consumption by jobs, … 7 Nov, power Cut at RAL. The site managed to recovered within 24 hours Jobs cannot find their input data, mostly at IN2P3and GRIDKA, i.e. sites with several “attached” T2s, overload of SRM Mid Oct, disk server failure at CNAF, storage out for several days. After recovery CNAF allotted the double amount of job slots in order to recover 2012 Operations
Queue Info http://lhcbproject.web.cern.ch/lhcbproject/Operations/queues.html • Page providing information on queues as seen from LHCb via BDII • Some sites seem to provide wrong information • As a consequence LHCb is submitting jobs to queues which it shouldn’t and the local batch system will kill those jobs after max CPU time used • We had to remove temporarily some sites for reprocessing b/c of this • Currently campaign to cleanup these wrong values ongoing 2012 Operations
Interaction with Sites • Several sites inform about major downtimes well in advance • Very welcome as it facilitates mid-term planning • How to reach Tier2 Sites? • LHCb does not have infrastructure to interact constantly with Tier2s • Can we involve some (wo)men in the middle for this interaction? • Eg. provide info on processing plans, BDII issue, … 2012 Operations
CVMFS Deployment • CVMFS deployment is high priority for LHCb(+WLCG) • Once we reach 100 %will facilitate the swdistribution process • All sites supporting LHCb are highly encouraged to install CVMFS • Currently 45 out 96 siteshave CVMFS deployed • Status info available athttps://maps.google.com/maps?q=http://cern.ch/lhcbproject/CVMFS-map/cvmfs-lhcb.kml 2012 Operations
OUTLOOK 2012 Operations
Next Processing Activities Loads on sites storage systems • Reprocessing: Reconstruction + Stripping + Merging • Reconstruction run on “attached T2 sites” • Staging all RAW data from tape • Reco output (FULL.DST) migrated to tape (via disk BUFFER) • Replication of Merging output (DST) on multiple sites • Incremental Stripping: Stripping + Merging • Staging of all FULL.DST files • Producing up to ~ 20% additional DST files • Replication of DST on multiple sites 2012 Operations
Conclusions • Very good support by sites for LHCb operations • Very good interaction with Tier 1 sites • Improvements possible for Tier2s • LHCb has made good use of the provided resources in 2012 • Upcoming reviews of computing model and tools will have impact on processes next year • 2012 Reprocessing was good stress test for future operations • Changes in site infrastructures necessary for post LS1 2012 Operations
BACKUP 2012 Operations
Data Processing Workflow Replication 4’ 4’ 4’ FULL.DST FULL.DST FULL.DST 6 1 RAW RAW UNM.DST 9 Tape 7 Tape Buffer Disk Only Buffer 4 Merging Data Processing 2 Data Managemt 8 Destroy D1T0 10 D0T1 PHY.DST Stripping Disk Only Storage 5 3 Data File Reconstruction FULL.DST = reconstructed physics quantities UNM.DST = temporary output for physics stream, PHY.DST = file ready for physics user analysis
Additional Info • Dirac Job Status plots by final minor status don’t include the Statuses “Input Data Resolution” and “Pending Requests” because these are not final statuses • Yandex is not included in the CPU pie plots because it provides sometimes wrong (too high) info on running jobs and would dominate all plots 2012 Operations