290 likes | 303 Views
New WLCG Grid Service Monitoring Displays. James Casey, CERN IT-GD HEPIX, November 2007. Overview. Service Monitoring in WLCG Site Service Monitoring Nagios Central Monitoring GridMap Future work. WLCG Monitoring Working Groups. 3 groups created by Ian Bird, Oct’06
E N D
New WLCG Grid Service Monitoring Displays James Casey, CERN IT-GD HEPIX, November 2007
Overview • Service Monitoring in WLCG • Site Service Monitoring • Nagios • Central Monitoring • GridMap • Future work Nov 8th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays
WLCG Monitoring Working Groups • 3 groups created by Ian Bird, Oct’06 • “….to help improve the reliability of the grid infrastructure….” • “…. provide stakeholders with views of the infrastructure allowing them to understand the current and historical status of the service. …” • “… stakeholder are site administrators, grid service managers and operations, VOs, Grid Project management” System Management Fabric management Best Practices Security ……. Grid Services Grid sensors Transport Metric Repositories Views ……. System Analysis Application monitoring …… Nov 8th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays
Monitoring You can’t manage what you don’t measure... accuracy and credibility appropriate metrics - directly relevant to user experience • clearly defined and understood measurement instrumentation - active, passive, collection intervals, alarms data collection points - system element service real-time historical Sensors/Agents Transport Repositories Views Grid Monitoring Presentation automated decision making manual decision making Control Slide by Max Böhm, EDS Nov 8th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays
WLCG Grid Monitoring Landscape Domain Monitoring Tools in use Grid Applications Application monitoring Experiment Dashboards ... GStat SAM/GridView GridICE GridPP Real Time Monitor ... Grid Middleware centralservices Grid Services monitoring site services localresources Lemon/SLS Nagios Ganglia ... Local monitoring site Slide by Max Böhm, EDS Nov 8th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays
Grid Monitoring Landscape View site registry GOCDB (other monitoring tools) one per experiment Experiment/VOATLAS Experiment/VO ... Experiment/VO ... Exp. Dashb. VO jobs, data,site reliability RGMA, RGMA, MonALISA Exp. Dashb. AppLayer Apps HTTP/XML pull RGMA job state HTTP/XML push agents MonALISA DB access GOCDB, BDII real time 3D job view RTM Ganga/Panda AtlasProdDB FileCatalog ResourceBroker Info System html Central Services LDAP GOCDB, BDII site status + graphs RB RGMA BDII GStat LFC GridServices FTS LB HTTP/XML pull HTTP/XML DB access data transfer, job status,service availability submit test jobs GOCDB, BDII sites SAM GridView Site Services HTTP/SOAP push CE SE results FabricResources batch GOCDB, extBDII BDII +fabric/job infos GridICE CPUs TBs fabric infos LEMON Nov 8th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays
High-level Model See https://twiki.cern.ch/twiki/pub/LCG/GridServiceMonitoringInfo/0702-WLCG_Monitoring_for_Managers.pdf for details LEMON GridView Experiment Dashboard R-GMA Nagios GOCDB GridView HTTP GridIce Dashboard LDAP GridMap SAM GridIce SAME GridView Nov 8th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays
Grid Site Monitoring Principles • Provide an easily extensible site monitoring system • Or be able to plug grid features into existing site monitoring • Should be able to provide (or augment) alarms at the site for the grid services • Don’t force a solution on the site administrators • Should work with any fabric monitoring system that provides basic functionality • Provide the specific plugins to deal with the Grid • Probes that work for Grid Services • Enable export of the data from the site into standard grid monitoring systems e.g. SAM, GridView, GridICE,… • Avoid duplicate running of probes Nov 8th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays
Purpose • Bring in data from existing monitoring systems inside the site monitoring tools • Service Availability Monitoring (SAM) • Network performance monitoring (NPM) • Experiment site blacklists (FCR tool) • Experiment dashboards, … • Decided to create a prototype based on Nagios • Due to existing take-up of Nagios in the community • Second stage will be integrate with LEMON • As next most common solution • Based on questionnaire to community Nov 8th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays
Nagios • Open source monitoring system • Widely used & actively developed • Host and service problems detection and recovery • Provides set of basic plugins (sensors) • easy to develop custom sensors • No components required on monitored entities Nov 8th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays
Site admins Issue alarms Get site status Get Nagios results Get remote results Get VOMS proxy Get site’s & nodes information Refresh proxy Probe descriptions MyProxy … Live node checks Get nodes information Service checks Architecture Monitoring server Site nodes … CE SE LFC Site BDII Nov 8th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays
Grid Extensions • Standard probes • provided by SRCE, CERN, OSG • Security facilities & services • CA distribution, Certificate lifetime, MyProxy • Monitoring & information services • R-GMA, BDII, MDS, GridICE • Job management services • Globus Gatekeeper, RB, WMS, WMProxy, Job matching • Data management services • GridFTP, SRM, DPNS, LFC, FTS • Remote gatherers • SAM & NPM • Nagios Config Generator (NCG), Publisher, Credential management Nov 8th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays
Standard Components • Probe wrapper • enables integration of standardized probes • One probe can run in Nagios, LEMON, SAM, … • Grid Monitoring Probes Specification • https://twiki.cern.ch/twiki/bin/view/LCG/GridMonitoringProbeSpecification • Publisher & remote gatherers • integration with other tools • Existing tools can just consume the data. E.g SAM, GridView, Dashboards… • Grid Monitoring Data Exchange Standard • https://twiki.cern.ch/twiki/bin/view/LCG/GridMonitoringDataExchangeStandard Comments, contributions & probes welcome! Nov 8th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays
Nov 8th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays
Nov 8th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays
Nov 8th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays
SAM Standard probes NPM Nov 8th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays
Nov 8th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays
Current Status • Three sets of standard probes integrated • SRCE, CERN, OSG • RPMs in apt and yum repository • http://www.sysadmin.hep.ac.uk • Installation documentation on twiki • https://twiki.cern.ch/twiki/bin/view/LCG/GridMonitoringNagiosInstall • Mailing list for community support of sites • wlcg-monitoring-discuss@cern.ch • Will appear in upcoming gLite releases as packaged software • Will be bundled with “follow-up” documentation to help site admins understand what went wrong on probe failure New (early-access) volunteers welcome! Nov 8th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays
New visualizations for the Grid ? • Grid monitoring data is complex! • And there are many sites… • Current tools visualize data by sorted tables, bar charts, etc. • Difficult to present an easy to understand top-level view which provides • quick, action oriented oversight and insight • help understand job failures and availability patterns Can new visualizations help? Nov 8th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays
GridMap Visualization • Idea • visualize the Grid by using Treemaps (Grid + Treemap = GridMap) • Example GridMap regions site Size of rectangle is e.g. - size of site (#CPUs) - #running jobs - ... Nov 8th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays
GridMap Visualization • Idea • visualize the Grid by using Treemaps (Grid + Treemap = GridMap) • Example GridMap ok degraded down Colour of rectangle is e.g. - SAM status of site / service - Availability of site / service - ... Nov 8th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays
Multiple Views • GridMaps can be used for top-level, geographical and VO views Global GridMap Top-level View Application Domain GridMap Large-scale Federated Grid Services Infrastructure VO Viewscross-location Corrective action effect Alert Federation,Partner,Site, etc. GeographicalViews Local GridMap Local GridMap Local GridMap Next level of GridMaps Nov 8th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays
Trends • Trends can be understood by looking at a sequence of GridMaps Site Availability over time: 20 Sep 2007 21 Sep 2007 22 Sep 2007 23 Sep 2007 24 Sep 2007 25 Sep 2007 Nov 8th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays
More Views • Correlations of metrics can be discovered by switching between different views sites without colour do not support the VO Site Availability from different VO perspectives: OPS Alice Atlas CMS LHCb Status of different Site Services: Overall Site CE SE SRM site BDII Nov 8th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays
GridMap Prototype Architecture GridMap Server GridMap View existingmonitoringsystem(s) Title view1 view2 view3 GridMapServer Web Browser - provides client side code and client supporting services - implements GridMap Layout Algorithm - retrieves and caches data from existing monitoring systems - POC implementation is based on Apache / Python - Browser based Web 2.0 type client component - single interactive and responsive web page (no page reloads required, data is retrieved in the background) - fast switching between views possible - details of the site/service statuses are shown as a context sensitive Tooltip - POC implementation is based on HTML, lightweight JavaScript libraries, AJAX type communication pattern Grid sites Nov 8th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays
GridMap Prototype View Component Link: http://gridmap.cern.ch Drilldown into region by clicking on the title Grid topology view (grouping) Metric selection for size of rectangles Metric selection for colour of rectangles VO selection Overall Site or Site Service selection Show SAM status Show GridView availability data Description of current view Context sensitive information Colour Key Nov 8th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays
GridMap Prototype: Link to Existing Tools • Clicking on a site opens a page with details in GridView/SAM Site Detail Availability SAM Test Results Nov 8th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays
Conclusions • To improve reliability we need to: • Provide more information to site administrators • That relate to what users actually see when using their site • A lot of data already gathered, so if possible don’t do it again • Need to get it into the fabric monitoring system already used at a site • Nagios-based prototype validating the approach • Good feedback form early adoptors • Improve the visualization • Too much data - especially for central monitoring (~250 sites) • New techniques help to compress information and bring useful information into view http://gridmap.cern.ch http://nagios-test.cern.ch/nagios (guest:guest) Nov 8th 2007/ HEPIX / New WLCG Grid Service Monitoring Displays