Experiment Operations

Experiment Operations Simone Campana

Outline • Try to answer to the following questions: • How are experiment operations organized? • Which Communication Channels are used? • Which are the commonalities? • Which are the differences? Thanks to Patricia Mendez Lorenzo, Roberto Santinelli and Andrea Sciaba + many other from experiments

CMS Computing Operations • Computing Shift Person (CSP) at the CMS centre at CERN or FNAL • Monitors the computing infrastructure and services going through a checklist • Identifies problems, triggers actions and calls • Creates eLog reports and support tickets • Reacts to unexpected events • Computing Run Coordinator (CRC) at CERN • Overview of offline computing plans and status, operational link with online, keeps track of open computing issues • Is a computing expert • Expert On Call (EOC), physically located anywhere in the world • Very expert in one or more aspects of the computing system (there can be more than one) • Must be on call

CMS Computing Operations • Data Operations expert on call: • Runs the T0 workflows and the T1 transfers • Monitors the above workflows • Time Coverage • During global runs: • Computing Shift Person: 8 hours shift, 16/7 coverage • DataOps expert: 16/7 mandatory, 24/7 voluntary • Otherwise (local runs): • CSP: 8/5 coverage • DataOps expert: just on call

LHCb Computing Operations • Grid Shifters (a.k.a production shifters) • Running production and data handling activities • Identifying and escalating problems • Some not-so-basic knowledge of Grid services and LHCb framework • See tick list for more information: https://twiki.cern.ch/twiki/pub/LHCb/ProductionOperations/GridShifter170808.pdf • Grid Expert on call • addressing problems • defining/improving operational procedures. • Production Manager (based at CERN) • Organizes the overall production • Dirac Developers experts • Fraction of time dedicated to run Grid Operations • All Grid Operations are run from CERN • With the exception of some contact persons at T1s whose role also fits in one of the above

LHCb Time Coverage • LHC down : decided to move to 1 shifter for working hours For more information please check the production operations web page https://twiki.cern.ch/twiki/bin/view/LHCb/ProductionOperations

ALICE Computing Operations • ALICE Computing Operations is a joined effort between: • ALICE Core offline team running ALICE operations. • Centralized at CERN • WLCG ALICE experiment support i.e. people offering Grid expertise to ALICE • Production manager organizing the overall activity • with workflow and component experts behind • data expert, workload expert, Alien expert etc ... • Offline shifts in the ALICE control room (P2) • Support the central GRID services and management tasks. • RAW data registration (T0) and replication to T1s • Conditions data gathering, storage and replication • Quasi online first pass reconstruction at T0 • and asynchronous second pass at T1s • ALICE Central Services status • ALICE Site Services (VO-box/WMS/storage) status

ALICE Time Coverage • Offline shifts 24/7 during data taking • First line support at CERN provided by IT/GS. • Site support is tiered and assured by regional experts • one per country/region, in contact with site experts. • supported by the Core Offline and/or by the WLCG experts for high level or complex Grid issues. • very important to emphasize the importance of the support also at T2 sites

ATLAS Computing Operations • ATLAS Computing Shift at P1: 24(16)/7 during data taking • T0 shifter • Monitor Data collection and recording from P1 to T0 • Monitor First processing at T0 • Distributed Computing Shifter • Monitor T0-T1 and T1-T1 data distribution • Database shifter • ATLAS Distributed Computing Shifts (ADCoS) • Several level of expertise: Trainee, Senior, Expert, Coordinator • Monitor Monte Carlo production and T2 transfer activities • ATLAS Expert On-Call: 24/7 • Offers expertise for data distribution activities • Developers and single components experts: best effort • offering third level support

ADCoS Time Coverage Europe 5 experts+10 seniors+ 5 trainees Asia: 4 seniors+1trainee America: 2 experts+5 seniors+ 3 trainees Covering 24h/day and 6 days/week, having people in three time-zones (no need for night shifts)

CMS Comunication Channels • eLog (using DAQ eLog + FNAL eLog, will have dedicated CERN box) • “Computing plan of the day” (by the CRC) • AIM accounts for shifters • Savannah • + GGUS for EGEE sites • Sites  Operations: Savannah + HN • Operations  Sites: Savannah, GGUS (+HN) • Users  Operations: CMS user support (Savannah + email)

LHCb Communication Channels • Internally LHCb: • Elog book: http://lblogbook.cern.ch/Operations/ • 14X7 :Expert cell-phone number: 16-1914 • Daily meeting (14:30 – 15:??) • Mailing list: lhcb-grid@cern.ch (for ops matters) lhcb-dirac@cern.ch (for dev matters) mailing list for each contact person. • Outreaching services and sites: • GGUS and/or Remedy • ALARM tickets just for test, TEAM ticket not extensively used yet • WLCG daily and weekly meetings • IT/LHCb coordination meeting, SCM meeting • Higher level meetings (GDB/MB) • Local contact person and central grid coordinator person useful for speeding up resolution of problems • Being reached from users and sites: • Support unit defined in GGUS • Mailing lists • Contact persons acting as liaison/reference for many site admins and service providers

ALICE Communication Channels • Internal ALICE communication • Mailing list • ALICE-LCG-EGEE Task Force • Communication with users and User Support • Mailing list for operational problems and Savannah tracker for bugs. • Monthly User Forums (EVO) for dissemination of new Grid related information and analysis news. • And monthly Grid training for new users • Communication with sites and Grid operation support • TASK force Mailing List for operational problems • GGUS • daily WLCG ops meetings • weekly ALICE-LCG taskforce meetings • Dedicated contacts with many sites

ATLAS Communication Channels • Internal Communication • ADCoS ELOG + T0 ELOG + ADCS@P1 ELOG • Savannah for DDM problem tracking • Communication with sites • Mainly GGUS • Team Tickets for all shifts + ALARM tickets for restricted list of experts • Support Mailing Lists • mostly for CERN (CASTOR, FTS, LFC) • Cloud Mailing Lists • Informational only • Many sites read ELOG • No clear site2ATLAS channel • ATLAS operations mailing list, but something better should be thought. • Communication with Users • Mostly HN for Operations2Users • GGUS + Savannah for Users2Operations • … and meetings: Daily WLCG Meeting, weekly ATLAS ops

Conclusions (I) • Experiment Operations rely on multilevel operation mode • First line shift crew • Second line Experts On-Call • Developers as third line support • not necessarily on-call • Experiments Operations strongly integrated with WLCG operations and Grid Service Support • Expert support • Escalation procedures • Especially for critical issues or long standing issues • Incidents Post Mortems • Communications and Notifications • I personally like the daily 15:00h meeting

Conclusions (II) • ATLAS and CMS rely on a more distributed operation model • Worldwide shifts and experts on call • Central Coordination always at CERN • Possibly due to geographical distribution of partner sites • Especially for US and Asia regions • All experiments recognize the importance of experiment dedicated support at sites • CMS can rely on contacts at every T1 and T2 • ATLAS and ALICE can rely on contacts per region/cloud • Contact at all T1s, usually dedicated • Some dedicated contact also at some T2 • LHCb can rely on contacts at some T1

Experiment Operations

Experiment Operations

Presentation Transcript

Experiment

Experiment

Experiment

Experiment

Experiment

Experiment

Experiment

Experiment

Experiment

Experiment

EXPERIMENT

Experiment

Experiment

Experiment

Experiment

Experiment

Experiment

WLCG Operations and Tools TEG Monitoring – Experiment Perspective

Experiment Operations: Priority issues for ALICE

experiment

Experiment

Experiment