110 likes | 211 Views
Catalin Cirstoiu 28/11/2007 WLCG Critical Service Reliability Workshop CERN. Experiment Critical Services and Monitoring What's Missing for CCRC'08 (and beyond)? ALICE viewpoint. Contents. What we have Data collection and storage Visualization methods Tools What we’re missing.
E N D
Catalin Cirstoiu 28/11/2007 WLCG Critical Service Reliability Workshop CERN Experiment Critical Services and MonitoringWhat's Missing for CCRC'08 (and beyond)?ALICE viewpoint http://pcalimonitor.cern.ch/
Contents http://pcalimonitor.cern.ch/ • What we have • Data collection and storage • Visualization methods • Tools • What we’re missing
Data collection and storage http://pcalimonitor.cern.ch/ AliEn CE AliEn CE Cluster Monitor Cluster Monitor AliEn IS AliEn Optimizers AliEn Job Agent AliEn Job Agent AliEn Brokers ApMon ApMon AliEn TQ ApMon ApMon ApMon ApMon AliEn SE AliEn SE ApMon ApMon ApMon ApMon MySQL Servers ApMon ApMon ApMon CastorGrid Scripts AliEn Job Agent AliEn Job Agent AliEn Job Agent AliEn Job Agent ApMon ApMon ApMon ApMon ApMon API Services ApMon MonALISA @Site MonALISA LCG Site MonALISA @CERN job slots run time net In/out cpu time free space processes load jobs status vsz sockets rss migrated mbytes active sessions Aggregated Data nr. of files open files Queued JobAgents job status cpu ksi2k MonaLisa Repository Alerts disk used Actions MyProxy status Long History DB SAM LCG Tools
General site status map http://pcalimonitor.cern.ch/
Storage status http://pcalimonitor.cern.ch/
ALICE SAM specific site tests http://pcalimonitor.cern.ch/
Tools http://pcalimonitor.cern.ch/ • Anybody can subscribe to be notified by email or through RSS feeds in case of problems with various components of the system: central/site services, storages, proxies, general announcements and so on: http://pcalimonitor.cern.ch/xml.jsp • A Firefox toolbar helps to quickly spot current issues: • Certificate-based administrative interface helps the Grid managers with day-to-day operations (site services management, production jobs, software packages, pledged resources tracking etc)
We monitor FTS transfers… http://pcalimonitor.cern.ch/ • From AliEn’s FTD • Speed, size, SE in/out, status • From ARDA FTS Dashboard • Site efficiency, errors tracking • FTS Services status monitoring at T0 • We know that a transfer has failed from the ALICE dashboard, but we don’t know if this is ALICE specific… • There was a general (all VOs) status page - moved? Was http://pcitgm02.cern.ch:8081/ … But we’re missing
E-mail notifications http://pcalimonitor.cern.ch/ • From MonALISA/SAM • Services malfunctions - sent to (or subscribed to) site experts/GGUS tickets/mailing lists • From EGEE Broadcast - The CIC portal • Announcements of downtimes, scheduled and unscheduled interventions
CCRC’08 general logbook http://pcalimonitor.cern.ch/ • General flow of services messages: • CERN FTS service restart - downtime 15” • ALICE transfers to ‘SARA’ fail consistently with error message ‘The file has size _size_ and should have _size_’ • This logbook should be accessible and populated by the experiment and services responsible and serves as a central information system for the exercise • In the previous exercise this information was kept in experiment-specific Wiki pages, but this is not the right tool
Thank you! http://pcalimonitor.cern.ch/ Questions? http://alien.cern.chhttp://monalisa.caltech.edu