160 likes | 169 Views
Explore how experiment operations are organized and the communication channels used. Identify commonalities and differences among CMS, LHCb, ALICE, and ATLAS computing operations.
E N D
Experiment Operations Simone Campana
Outline • Try to answer to the following questions: • How are experiment operations organized? • Which Communication Channels are used? • Which are the commonalities? • Which are the differences? Thanks to Patricia Mendez Lorenzo, Roberto Santinelli and Andrea Sciaba + many other from experiments
CMS Computing Operations • Computing Shift Person (CSP) at the CMS centre at CERN or FNAL • Monitors the computing infrastructure and services going through a checklist • Identifies problems, triggers actions and calls • Creates eLog reports and support tickets • Reacts to unexpected events • Computing Run Coordinator (CRC) at CERN • Overview of offline computing plans and status, operational link with online, keeps track of open computing issues • Is a computing expert • Expert On Call (EOC), physically located anywhere in the world • Very expert in one or more aspects of the computing system (there can be more than one) • Must be on call
CMS Computing Operations • Data Operations expert on call: • Runs the T0 workflows and the T1 transfers • Monitors the above workflows • Time Coverage • During global runs: • Computing Shift Person: 8 hours shift, 16/7 coverage • DataOps expert: 16/7 mandatory, 24/7 voluntary • Otherwise (local runs): • CSP: 8/5 coverage • DataOps expert: just on call
LHCb Computing Operations • Grid Shifters (a.k.a production shifters) • Running production and data handling activities • Identifying and escalating problems • Some not-so-basic knowledge of Grid services and LHCb framework • See tick list for more information: https://twiki.cern.ch/twiki/pub/LHCb/ProductionOperations/GridShifter170808.pdf • Grid Expert on call • addressing problems • defining/improving operational procedures. • Production Manager (based at CERN) • Organizes the overall production • Dirac Developers experts • Fraction of time dedicated to run Grid Operations • All Grid Operations are run from CERN • With the exception of some contact persons at T1s whose role also fits in one of the above
LHCb Time Coverage • LHC down : decided to move to 1 shifter for working hours For more information please check the production operations web page https://twiki.cern.ch/twiki/bin/view/LHCb/ProductionOperations
ALICE Computing Operations • ALICE Computing Operations is a joined effort between: • ALICE Core offline team running ALICE operations. • Centralized at CERN • WLCG ALICE experiment support i.e. people offering Grid expertise to ALICE • Production manager organizing the overall activity • with workflow and component experts behind • data expert, workload expert, Alien expert etc ... • Offline shifts in the ALICE control room (P2) • Support the central GRID services and management tasks. • RAW data registration (T0) and replication to T1s • Conditions data gathering, storage and replication • Quasi online first pass reconstruction at T0 • and asynchronous second pass at T1s • ALICE Central Services status • ALICE Site Services (VO-box/WMS/storage) status
ALICE Time Coverage • Offline shifts 24/7 during data taking • First line support at CERN provided by IT/GS. • Site support is tiered and assured by regional experts • one per country/region, in contact with site experts. • supported by the Core Offline and/or by the WLCG experts for high level or complex Grid issues. • very important to emphasize the importance of the support also at T2 sites
ATLAS Computing Operations • ATLAS Computing Shift at P1: 24(16)/7 during data taking • T0 shifter • Monitor Data collection and recording from P1 to T0 • Monitor First processing at T0 • Distributed Computing Shifter • Monitor T0-T1 and T1-T1 data distribution • Database shifter • ATLAS Distributed Computing Shifts (ADCoS) • Several level of expertise: Trainee, Senior, Expert, Coordinator • Monitor Monte Carlo production and T2 transfer activities • ATLAS Expert On-Call: 24/7 • Offers expertise for data distribution activities • Developers and single components experts: best effort • offering third level support
ADCoS Time Coverage Europe 5 experts+10 seniors+ 5 trainees Asia: 4 seniors+1trainee America: 2 experts+5 seniors+ 3 trainees Covering 24h/day and 6 days/week, having people in three time-zones (no need for night shifts)
CMS Comunication Channels • eLog (using DAQ eLog + FNAL eLog, will have dedicated CERN box) • “Computing plan of the day” (by the CRC) • AIM accounts for shifters • Savannah • + GGUS for EGEE sites • Sites Operations: Savannah + HN • Operations Sites: Savannah, GGUS (+HN) • Users Operations: CMS user support (Savannah + email)
LHCb Communication Channels • Internally LHCb: • Elog book: http://lblogbook.cern.ch/Operations/ • 14X7 :Expert cell-phone number: 16-1914 • Daily meeting (14:30 – 15:??) • Mailing list: lhcb-grid@cern.ch (for ops matters) lhcb-dirac@cern.ch (for dev matters) mailing list for each contact person. • Outreaching services and sites: • GGUS and/or Remedy • ALARM tickets just for test, TEAM ticket not extensively used yet • WLCG daily and weekly meetings • IT/LHCb coordination meeting, SCM meeting • Higher level meetings (GDB/MB) • Local contact person and central grid coordinator person useful for speeding up resolution of problems • Being reached from users and sites: • Support unit defined in GGUS • Mailing lists • Contact persons acting as liaison/reference for many site admins and service providers
ALICE Communication Channels • Internal ALICE communication • Mailing list • ALICE-LCG-EGEE Task Force • Communication with users and User Support • Mailing list for operational problems and Savannah tracker for bugs. • Monthly User Forums (EVO) for dissemination of new Grid related information and analysis news. • And monthly Grid training for new users • Communication with sites and Grid operation support • TASK force Mailing List for operational problems • GGUS • daily WLCG ops meetings • weekly ALICE-LCG taskforce meetings • Dedicated contacts with many sites
ATLAS Communication Channels • Internal Communication • ADCoS ELOG + T0 ELOG + ADCS@P1 ELOG • Savannah for DDM problem tracking • Communication with sites • Mainly GGUS • Team Tickets for all shifts + ALARM tickets for restricted list of experts • Support Mailing Lists • mostly for CERN (CASTOR, FTS, LFC) • Cloud Mailing Lists • Informational only • Many sites read ELOG • No clear site2ATLAS channel • ATLAS operations mailing list, but something better should be thought. • Communication with Users • Mostly HN for Operations2Users • GGUS + Savannah for Users2Operations • … and meetings: Daily WLCG Meeting, weekly ATLAS ops
Conclusions (I) • Experiment Operations rely on multilevel operation mode • First line shift crew • Second line Experts On-Call • Developers as third line support • not necessarily on-call • Experiments Operations strongly integrated with WLCG operations and Grid Service Support • Expert support • Escalation procedures • Especially for critical issues or long standing issues • Incidents Post Mortems • Communications and Notifications • I personally like the daily 15:00h meeting
Conclusions (II) • ATLAS and CMS rely on a more distributed operation model • Worldwide shifts and experts on call • Central Coordination always at CERN • Possibly due to geographical distribution of partner sites • Especially for US and Asia regions • All experiments recognize the importance of experiment dedicated support at sites • CMS can rely on contacts at every T1 and T2 • ATLAS and ALICE can rely on contacts per region/cloud • Contact at all T1s, usually dedicated • Some dedicated contact also at some T2 • LHCb can rely on contacts at some T1