210 likes | 307 Views
D Ø Computing Model & Monte Carlo & Data Reprocessing. Gavin Davies Imperial College London. DOSAR Workshop, Sao Paulo, September 2005. Outline. Operational status Globally – continue to do well Shared by recent Run II Computing Review D Ø Computing model
E N D
DØ Computing Model & Monte Carlo & Data Reprocessing Gavin Davies Imperial CollegeLondon DOSAR Workshop, Sao Paulo, September 2005
Outline • Operational status • Globally – continue to do well • Shared by recent Run II Computing Review • DØ Computing model • Ongoing, ‘long’ established plan • Production Computing • Monte Carlo • Reprocessing of Run II data • 109 events reprocessed on the grid – largest HEP grid effort • Looking forward • Conclusions DOSAR Workshop, Sept 2005
Snapshot of Current Status • Reconstruction keeping up with data taking • Data handling is performing well • Production computing is off-site and grid based. It continues to grow & work well • Over 75 million Monte Carlo events produced in last year • Run IIa data set being reprocessed on the grid – 109 events • Analysis cpu power has been expanded • Globally doing well • Shared by recent Run II Computing Review DOSAR Workshop, Sept 2005
Computing Model • Started with distributed computing with evolution to automated use of common tools/solutions on the grid (SAM-Grid) for all tasks • Scalable • Not alone – Joint effort with others at FNAL and elsewhere, LHC… • 1997 – Original Plan • All Monte Carlo to be produced off-site • SAM to be used for all data handling, provides a ‘data-grid’ • Now: Monte Carlo and data reprocessing with SAM-Grid • Next: Other production tasks e.g. fixing and then user analysis • Use concept of Regional Centres • DOSAR one of pioneers • Builds local expertise DOSAR Workshop, Sept 2005
Reconstruction Release • Periodically update version of reconstruction code • As develop new / more refined algorithms • As get better understanding of detector • Frequency of releases decreases with time • One major release in last year – p17 • Basis for current Monte Carlo (MC) & data reprocessing • Benefits of p17 • Reco speed-up • Full calorimeter calibration • Fuller description of detector material • Use of zero-bias overlay for MC • (More details: http://cdinternal.fnal.gov/RUNIIRev/runIIMP05.asp) DOSAR Workshop, Sept 2005
Data Handling - SAM • SAM continues to perform well, providing a data-grid • 50 SAM sites worldwide • Over 2.5 PB (50B events) consumed in the last year • Up to 300 TB moved per month • Larger SAM cache solved tape access issues • Continued success of SAM shifters • Often remote collaborators • Form 1st line of defense • SAMTV monitors SAM & SAM stations http://d0db-prd.fnal.gov/sm_local/SamAtAGlance/ DOSAR Workshop, Sept 2005
SAMGrid More than 10 DØ execution sites http://samgrid.fnal.gov:8080/ SAM – data handling JIM – job submission & monitoring SAM + JIM SAM-Grid http://samgrid.fnal.gov:8080/list_of_schedulers.php http://samgrid.fnal.gov:8080/list_of_resources.php DOSAR Workshop, Sept 2005
Remote Production Activities – Monte Carlo - I • Over 75M events produced in last year, at more than 10 sites • More than double last year’s production • Vast majority on shared sites • DOSAR major part of this • SAM-Grid introduced in spring 04, becoming the default • Based on request system and jobmanager-mc_runjob • MC software package retrieved via SAMo way, inc central farm • Average production efficiency ~90% • Average inefficiency due to grid infrastructure ~1-5% • http://www-d0.fnal.gov/computing/grid/deployment-issues.html • Continued move to common tools • DOSAR sites continue move to SAMGrid from McFarm From 04 DOSAR Workshop, Sept 2005
Remote Production Activities – Monte Carlo - II • Beyond just ‘shared’ resources • More than 17M events produced ‘directly’ on LCG via submission from Nikhef • Good example of remote site driving the ‘development’ • Similar momentum building on/for OSG • Two good site examples within p17 reprocessing DOSAR Workshop, Sept 2005
Remote Production Activities – Reprocessing - I • After significant improvements to reconstruction, reprocess old data • P14 Winter 2003/04 • 500M events, 100M remotely, from DST • Based around mc_runjob • Distributed computing rather than Grid • P17 End march ~Oct • x 10 larger ie 1000M events, 250TB • Basically all remote • From raw ie use of db proxy servers • SAM-Grid as default (using mc_runjob) • 3200 1GHz PIIIs for 6 months • Massive activity - largest grid activity in HEP http://www-d0.fnal.gov/computing/reprocessing/p17/ DOSAR Workshop, Sept 2005
Reprocessing - II Grid jobs spawns many batch jobs “Production” “Merging” DOSAR Workshop, Sept 2005
Reprocessing -III • SAMGrid provides • Common environment & operation scripts at each site • Effective book-keeping • SAM avoids data duplication + defines recovery jobs • JIM’s XML-DB used to ease bug tracing • Tough deploying a product, under evolution with limited manpower to new sites (we are a running experiment) • Very significant improvements in JIM (scalability) during this period • Certification of sites - Need to check • SAMGrid vs usual production • Remote sites vs central site • Merged vs unmerged files FNAL vs SPRACE DOSAR Workshop, Sept 2005
( = production speed in M events / day) ( = number batch jobs completing successfully) Reprocessing - IV • Monitoring (illustration) • Overall efficiency, speed or by site. • Status – into the “end-game” • Data sets all allocated, moving to ‘cleaning-up’ • Must now push on the Monte Carlo http://samgrid.fnal.gov:8080/cgi-bin/plot_efficiency.cgi ~855 Mevents done DOSAR Workshop, Sept 2005
SAM-Grid Interoperability • Need access to greater resources as data sets grow • Ongoing programme on LCG and OSG interoperability • Step 1 (co-existence) – use shared resources with SAM-Grid head-node • Widely done for both Reprocessing and MC • OSG co-existence shown for data reprocessing • Step 2 – SAMGrid-LCG interface • SAM does data handling & JIM job submission • Basically forwarding mechanism • Prototype established at IN2P3/Wuppertal • Extending to production level • OSG activity increasing – build on LCG experience • Team work between core developers / sites DOSAR Workshop, Sept 2005
Looking Forward • Increased data sets require increased resources for MC, repro etc • Route to these is increased use of grid and common tools • Have an ongoing joint program, but work to do.. • Continue development of SAM-Grid • Automated production job submission by shifters • Deployment team • Bring in new sites in manpower efficient manner • ‘Benefit’ of a new site goes well beyond a ‘cpu’ count – we appreciate / value this. • Full interoperability • Ability to access efficiently all shared resources • Additional resources for above recommended by Taskforce DOSAR Workshop, Sept 2005
Conclusions • Computing model continues to be successful • Based around grid-like computing, using common tools • Key part of this is the production computing – MC and reprocessing • Significant advances this year: • Continued migration to common tools • Progress on interoperability, both LCG and OSG • Two reprocessing sites operating under OSG • P17 reprocessing – a tremendous success • Strongly praised by Review Committee • DOSAR major part of this • More ‘general’ contribution also strongly acknowledged. • Thank you • Let’s all keep up the good work DOSAR Workshop, Sept 2005
Back-up DOSAR Workshop, Sept 2005
Terms • Tevatron • Approx equiv challenge to LHC in “today’s” money • Running experiments • SAM (Sequential Access to Metadata) • Well developed metadata and distributed data replication system • Originally developed by DØ & FNAL-CD • JIM (Job Information and Monitoring) • handles job submission and monitoring (all but data handling) • SAM + JIM →SAM-Grid – computational grid • Tools • Runjob - Handles job workflow management • dØtools – User interface for job submission • dØrte - Specification of runtime needs DOSAR Workshop, Sept 2005
Reminder of Data Flow • Data acquisition (raw data in evpack format) • Currently limited to 50 Hz Level-3 accept rate • Request increase to 100 Hz, as planned for Run IIb – see later • Reconstruction (tmb/DST in evpack format) • Additional information in tmb → tmb++ (DST format stopped) • Sufficient for ‘complex’ corrections, inc track fitting • Fixing (tmb in evpack format) • Improvements / corrections coming after cut of production release • Centrally performed • Skimming (tmb in evpack format) • Centralised event streaming based on reconstructed physics objects • Selection procedures regularly improved • Analysis (out: root histogram) • Common root-based Analysis Format (CAF) introduced in last year • tmb format remains DOSAR Workshop, Sept 2005
Remote Production Activities – Monte Carlo DOSAR Workshop, Sept 2005
The Good and Bad of the Grid • Only viable way to go… • Increase in resources (cpu and potentially manpower) • Work with, not against, LHC • Still limited BUT • Need to conform to standards – dependence on others.. • Long term solutions must be favoured over short term idiosyncratic convenience • Or won’t be able to maintain adequate resources. • Must maintain production level service (papers), while increasing functionality • As transparent as possible to non-expert DOSAR Workshop, Sept 2005