The ATLAS Computing Model Overview

The ATLAS Computing Model Overview Shawn McKee University of Michigan Tier-3 Network Mtg at Michigan March 30th, 2007

Overview • The ATLAS collaboration has less than a year before it must manage large amounts of “real” data for its globally distributed collaboration. • ATLAS physicists need the software and physical infrastructure required to: • Calibrate and align detector subsystems to produce well understood data • Realistically simulate the ATLAS detector and its underlying physics • Provide access to ATLAS data globally • Define, manage, search and analyze data-sets of interest • I will cover current status and plans for the ATLAS computing model and touch on the implications for Tier-3 centers and users ATLAS Shawn McKee

The ATLAS Computing Model • Computing Model is fairly well evolved, documented in C-TDR • http://doc.cern.ch//archive/electronic/cern/preprints/lhcc/public/lhcc-2005-022.pdf • There are many areas with significant questions/issues to be resolved: • Calibration and alignment strategy is still evolving • Physics data access patterns partially exercised (SC04) • Unlikely to know the real patterns until 2007/2008! • Still uncertainties on the event sizes , reconstruction time • How best to integrate ongoing “infrastructure” improvements from research efforts into our operating model? • Lesson from the previous round of experiments at CERN (LEP, 1989-2000) • Reviews in 1988 underestimated the computing requirements by an order of magnitude! Shawn McKee

ATLAS Computing Model Overview • ATLAS has a hierarchical model (EF-T0-T1-T2) with specific roles and responsibilities • Data will be processed in stages: RAW->ESD->AOD-TAG • Data “production” is well-defined and scheduled • Roles and responsibilities are assigned within the hierarchy. • Users will send jobs to the data and extract relevant data • typically NTuples or similar • Goal is a production and analysis system with seamless access to all ATLAS grid resources • All resources need to be managed effectively to insure ATLAS goals are met and resource providers policy’s are enforced. Grid middleware must provide this • NOTE: Tier-3 centers are outside the official infrastructure! Shawn McKee

ATLAS Facilities and Roles • Event Filter Farm at CERN • Assembles data (at CERN) into a stream to the Tier 0 Center • Tier 0 Center at CERN • Data archiving: Raw data to mass storage at CERN and to Tier 1 centers • Production: Fast production of Event Summary Data (ESD) and Analysis Object Data (AOD) • Distribution: ESD, AOD to Tier 1 centers and mass storage at CERN • Tier 1 Centers distributed worldwide (10 centers) • Data steward: Re-reconstruction of raw data they archive, producing new ESD, AOD • Coordinated access to full ESD and AOD (all AOD, 20-100% of ESD depending upon site) • Tier 2 Centers distributed worldwide (approximately 30 centers) • Monte Carlo Simulation, producing ESD, AOD, ESD, AOD sent to Tier 1 centers • On demand user physics analysis of shared datasets • Tier 3 Centers distributed worldwide • Physics analysis • A CERN Analysis Facility • Analysis • Enhanced access to ESD and RAW/calibration data on demand Shawn McKee

Computing Model: event data flow from EF • Events written in “ByteStream” format by the Event Filter farm in 2 GB files • ~1000 events/file (nominal size is 1.6 MB/event) • 200 Hz trigger rate (independent of luminosity) • Currently 4+ streams are foreseen: • Express stream with “most interesting” events • Calibration events (including some physics streams, such as inclusive leptons) • “Trouble maker” events (for debugging) • Full (undivided) event stream • One 2-GB file every 5 seconds will be available from the Event Filter • Data will be transferred to the Tier-0 input buffer at 320 MB/s (average) • The Tier-0 input buffer will have to hold raw data waiting for processing • And also cope with possible backlogs • ~125 TB will be sufficient to hold 5 days of raw data on disk Shawn McKee

ATLAS Data Processing • Tier-0: • Prompt first pass processing on express/calibration & physics streams • 24-48 hours, process full physics streams with reasonable calibrations • Implies large data movement from T0 →T1s, some T0 ↔ T2 (Calibration) • Tier-1: • Reprocess 1-2 months after arrival with better calibrations • Reprocess all local RAW at year end with improved calibration and software • Implies large data movement from T1↔T1 and T1 → T2 Shawn McKee

NE Tier2 Tier2 Tier2 Tier2 AGLTier2 HPSS HPSS HPSS HPSS ATLAS Tier’ed Hierarchy CERN/Outside Resource Ratio ~1:4Tier0/( Tier1)/( Tier2) ~1:2:2 ~PByte/sec ~200-1500 MBytes/sec Online System Offline Farm,CERN Computer Ctr ~25 TIPS Tier0 +1 10-40Gbits/sec HPSS Tier 1 France Italy UK BNL Tier 2 ~1-10+ Gbps Tier 3 Physicists work on analysis “channels” Each institute has ~10 physicists working on one or more channels Institute ~0.25TIPS Institute Institute Institute 100 - 10000 Mbits/sec Physics data cache Tier 4 Workstations ATLAS version from Harvey Newman’s original Shawn McKee

ATLAS Data to Data Analysis • Raw data • hits, pulse heights • Reconstructed data (ESD) • tracks, clusters… • Analysis Objects (AOD) • Physics Objects • Summarized • Organized by physics topic • Ntuples, histograms, statistical data Shawn McKee

ESD ESD ESD ESD ESD ATLAS Physics Analysis Event Tags Tier 0,1 Collaboration wide Event Selection Analysis Objects Calibration Data Analysis Processing Raw Data Tier 2 Analysis Groups PhysicsObjects StatObjects PhysicsObjects StatObjects PhysicsObjects StatObjects Tier 3, 4 Physicists Physics Analysis Shawn McKee

ATLAS Resource Requirements in for 2008 Computing TDR July 2006 updates reduced the expected contributions Shawn McKee

ATLAS Analysis Computing Model ATLAS Analysis model broken into two components • Scheduled central production of augmented AOD, tuples & TAG collections from ESD • Derived files moved to other T1s and to T2s • Chaotic user analysis of augmented AOD streams, tuples, new selections etc and individual user simulation and CPU-bound tasks matching the official MC production • Modest to large(?) job traffic between T2s (and T1s, T3s) Shawn McKee

Distributed Analysis • At this point emphasis is on a batch model to implement the ATLAS Computing model • Interactive solutions are difficult to realize on top of the current middleware layer • We expect ATLAS users to send large batches of short jobs to optimize their turnaround • Scalability • Data Access • Analysis in parallel to production • Job Priorities • Distributed analysis effectiveness depends strongly upon the hardware and software infrastructure. • Analysis is divided into “group” and “on demand” types Shawn McKee

ATLAS Group Analysis (Tier-2s) • Group analysis is characterised by access to full ESD and perhaps RAW data • This is resource intensive • Must be a scheduled activity • Can back-navigate from AOD to ESD at same site • Can harvest small samples of ESD (and some RAW) to be sent to Tier 2s • Must be agreed by physics and detector groups • Group analysis will produce • Deep copies of subsets • Dataset definitions • TAG selections • Big Trains • Most efficient access if analyses are blocked into a ‘big train’ • Idea around for a while, already used in e.g. heavy ions • Each wagon (group) has a wagon master=production manager • Must ensure will not derail the train • Train must run often enough (every ~2 weeks?) Shawn McKee

ATLAS On-demand Analysis (Tier-1/3) • Restricted Tier 2s and CAF • Could specialize some Tier 2s for some groups • ALL Tier 2s are for ATLAS-wide usage • Role and group based quotas are essential • Quotas to be determined per group not per user • Data Selection • Over small samples with Tier-2 file-based TAG and AMI dataset selector • TAG queries over larger samples by batch job to database TAG at Tier-1s/large Tier 2s • What data? • Group-derived EventViews • Root Trees • Subsets of ESD and RAW • Pre-selected or selected via a Big Train run by working group • Each user needs 14.5 kSI2k (about 12 current “cores” or one new dual quad) • 2.1TB ‘associated’ with each user on average Shawn McKee

ATLAS Distributed Data Management • Accessing distributed data on the Grid is not a simple task (see below!) • Several DBs are needed centrally to hold dataset information • “Local” catalogues hold information on local data storage • The new DDM system(right) is under test this summer • It is being usedfor all ATLAS datafrom October 2006 on • How do Tier-3 get access? Shawn McKee

Typical ATLAS Tier-3 Needs • Network • General comment is “more is better”. Sites responsiveness is directly proportional to its end-to-end bandwidth. • End-to-end capability is what matters: Campus with a 10 GE backbone link but a FastEthernet Tier-3 connection obviously won’t be as capable as a campus with 1 gigabit from cluster/storage machines out to the backbone. • Tier-3 centers with small computing resources and small storage are probably OK with 1 gigabit connectivity for the near term. • Bandwidth contention management/prioritization may be very useful/important. • Monitoring is vital to understand problems and plan for usage. • Compute and storage resources • Typical Tier-3 jobs are user “analysis”. A typical ATLAS user could be well served by a single 1U dual quad core workstation (~15 kSI2K) with fast access to storage. • Tier-3 can tune-up and validate major analysis tasks prior to submission to large compute farms for processing. • Tier-3 analysis codes typically operate on AOD/ntuple level inputs • Users will run event selection on Tier-2’s and move ntuples to Tier-3 for custom analysis • Assume users will need about 2TB of space for AOD/TAGS/ntuples. Shawn McKee

ATLAS is quickly approaching “real” data and our computing model has been successfully validated (as far as we have been able to take it). Some major uncertainties exist, especially around “user analysis” and what resource implications these may have. Tier-3’s should understand the overall model and how they can best fit in. Typical requirements are modest network connectivity, significant per user storage and modest computational power. The challenge will be to deploy and testthe ATLAS system down to the Tier-3 and user interface level in time to develop a responsive, effective system to support the users. Conclusions Questions? Shawn McKee

The ATLAS Computing Model Overview