1 / 55

Lemon Tutorial

Lemon Tutorial. Lemon Overview Miroslav Siket, Dennis Waldron http://cern.ch/lemon CERN-IT/FIO-FD. Tutorial. Why? Number of services is expanding. More to monitor every day. For whom? Service managers to configure monitoring of their services

myrrh
Download Presentation

Lemon Tutorial

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lemon Tutorial Lemon Overview Miroslav Siket, Dennis Waldron http://cern.ch/lemon CERN-IT/FIO-FD

  2. Tutorial • Why? • Number of services is expanding. More to monitor every day. • For whom? • Service managers to configure monitoring of their services • Developers to simplify their life when writing sensors • Site managers to setup their monitoring instances Lemon Tutorial

  3. Tutorial Outline • Architecture • Writing sensors • Running and configuring Agent • Using lemon tools • Running Lemon server(s) • Running and configuring web interface • Running alarm system Lemon Tutorial

  4. Architecture Lemon Tutorial

  5. Architecture II • Three layers: • Data producing/consuming • Data manipulation • Data Storage Lemon Tutorial

  6. Client side • Agent • forks sensors and communicate with them using custom protocol over a bi-directional “pipes” • configures metric instances of metric classes of a sensor and pulls for metrics • checks on status of sensors • agent sends data to servers using TCP or UDP • monitors itself with internal MSA sensor • caches data locally Default Linux client distribution comes with the agent, linux and file sensors. Footprint: agent - 5.5MB and 0.02% of CPU utilization* core sensors (Linux, file, exception) – 10MB, 0.2% of CPU* parseLog – 9.4MB Currently C++ and perl APIs available. * i386, SLC3/4, RHES3/4 – average over CERN CC Lemon Tutorial

  7. Server side • Two implementations: • Oracle based – OraMon • optimized for high performance and for large Computer Centers • runs on Oracle 9i+ (with alarms system on 10g) • validation of metric samples, metadata information • Flat files based – FlatMon (edg-fmon-server) • uses OS files for storing data • for smaller sites (scalable to 1000 machines max.) • General features: • multithreaded UDP/TCP server • built in authentication mechanism Lemon Tutorial

  8. Server side - planning Space considerations • About 400kB of data per machine/day (Oracle Enterprise edition with compression) – 700kB without compression (XE, Standard) • About 1.2MB for FlatMon per machine per day CPU considerations • Dual PIV, 3GHz, 4GB of memory with Oracle DB server + OraMon requires about 15% CPU for 4000 monitored machines • Adding Alarm system on Oracle requires additional 5% of CPU • FlatMon saturates the above machine with 1000 monitored hosts • OraMon/FlatMon require about 105MB of memory Functionality considerations • FlatMon does not provide metric checks and has no metadata concept • Lemon Alarm System (LAS) runs on Oracle as PL/SQL procedures and requires Oracle 10g – integrated with OraMon schema in Oracle database • For HA architecture, use Oracle RAC and multiple OraMon servers Lemon Tutorial

  9. User/administration tools Lemon-cli • Retrieving monitoring data from the local machine cache • Allows retrieving data from the server • Currently uses SOAP interface (to be retired soon) Lemon-host-check • Checks status of the machine based on the values of exceptions • Checks status of the monitoring agent and sensors • Manages status of exceptions Lemon Tutorial

  10. Configuration management At CERN we use Quattor Configuration Database • Configuration is stored in hierarchical templates per domain/cluster/node • NCM framework is used to download configuration XML profile to nodes • NCM components are used: • For agent/sensors configuration – using fmonagent component • For server configuration (metadata) – using oramonserver component For smaller sites with homogeneous structures • Use default agent and sensor rpms from Lemon • Use rpms for custom sensors/settings Lemon Tutorial

  11. Lemon RRD framework • User front-end for visualization and caching monitoring data • Two layers • Pre-processing – consumes monitoring data and creates rrd files per machine/cluster/… (aging, averages) - lemonmrd • Visualization – using rrd files for fast visualization or direct access to the monitoring repository – status web pages • Different plugins/options available: • Synoptic display of the Computer Center (XML driven) • Lemon Alarm GUI • Quattor .tpl file browser, … Requirements • Web server with PHP (v5+ if want to use LAS) • rrdtool rpm • 500kB space per machine’s rrd file Lemon Tutorial

  12. Automatic recovery actions and alarms • Sensor exception • For defined values of measured metrics an actuator is called with predefined action • An example: ssh daemon dead – action /sbin/service sshd start • Definition: metric X, field Y <op> reference value Z => call actuator • <op> can be ==,<,>,regexp, range, +,-,*,/ etc.. • Each occurrence is logged in the Monitoring Repository • Already about 230 predefined exceptions with automatic recovery actions • Exceptions are base for alarms in Lemon Alarm System • Allow multi-valued metrics and on-behalf metrics • Allow corrective actions (actuators) up to n-times or within given time window • Allow distinguishing of the alarm state (failed actuator, silenced,…) • Example: • (10004:7 > 100 && (10005:3 – 34:5)>100:56) • On behalf: (soap_srvx:302:1 > 10) Lemon Tutorial

  13. Lemon Alarm System Newest addition to Lemon Build on top of the OraMon schema in Oracle database Comes in two pieces: • PL/SQL stored procedures (requires Oracle 10g) to consume exceptions and to produce alarms • GUI – web based interface based on AJAX – part of LRF Features • Reduction of alarms (by type or by node/cluster) • Possibility to hide/inhibit alarms • Access control • History tracking • Future: notifications, RSS feeds Lemon Tutorial

  14. Software distribution RPM • direct download from http://lemon.web.cern.ch/lemon/downloads.shtml or at http://linuxsoft.cern.ch/lemon/ • YUM setup with/etc/yum.repos.d/lemon.repo [lemon] name=Lemon baseurl=http://linuxsoft.cern.ch/lemon/linux/RPMS/i386/sl4/stable/ enabled=1 gpgcheck=1 gpgkey=http://linuxsoft/lemon/RPM-GPG-KEY-lemon • APT setup with /etc/apt/sources.list.d/lemon.list # Lemon stable rpm http://linuxsoft.cern.ch/lemon linux/RPMS/i386/sl4 lemon_stable_sl4 Source code • CVS CVSROOT=:pserver:anonymous@isscvs.cern.ch:/local/reps/elfms Lemon Tutorial

  15. Future and additional information Things not covered/under development • XML gateway with API to several languages (C++, perl, python, java,…) • Python Sensor API • LAS notification, RSS feeds • Encryption of data between agent and server • Authentication for user access • Service views for LRF Check Web pages: http://cern.ch/lemon for additional information Lemon Tutorial

  16. Lemon Tutorial Sensor How-To Miroslav Siket, Dennis Waldron http://cern.ch/lemon CERN-IT/FIO-FD

  17. Outline • Terminology • Examples of existing sensors • Considerations • Live Examples • Hello World • Service based monitoring • Do’s and Don’ts Lemon Tutorial

  18. Terminology • Sensor: • A process or script which is connected to the lemon-agent via a bi-directional pipe and collects information on behalf of the agent. Sensors implement, • Metric Classes: • The equivalent to a class in OOP (Object Orientated Programming) • Metric Instance: • Is an instance (an object) of a metric class which has its own configuration data. • Metric ID: • A unique identifier associated with a particular metric instance of a particular metric class. Lemon Tutorial

  19. Existing sensors • At CERN: • Approx 40 active sensors defined, providing 264 metrics and 227 exceptions. • Default installation of the Lemon agent comes with three sensors: • MSA (builtin) – self monitoring of the agent. • Linux – performance, file system and process monitoring. • File – file tests e.g. size, mtime, ctime. • Together they provide 135metrics (51% of all CERN metrics) • Other officially distributed sensors include: • exception– correlation sensor for generating alarms. • remote– provides ping and http web server checks. • oracle – oracle database statistics monitoring. • parselog – log file parsing sensor. • All available from the lemon software repository http://linuxsoft.cern.ch/lemon/ • Other contributing sensors are available from CVS: CVSROOT=:pserver:anonymous@isscvs.cern.ch:/local/reps/elfms/sensors Lemon Tutorial

  20. Considerations • Question: What is your goal? How do you intend to use the monitoring information you collect? • Is it for: • Pure data collection? • OK • Graphs displayed on the lemon status pages? • Just because you’ve collected data doesn’t give you graphs immediately! This is not automatic! • Information to be alarmed? • Make sure the structure of the data you collect can be alarmed! • Data that cannot be alarmed: • Timestamps as strings - NO • Timestamps as numbers - NO • Parsing of complex strings - NO Lemon Tutorial

  21. Considerations (II) - Use Case • Grid Certificate Expiry Use Case Outline: you wish to be notified or raise an alarm if the Grid Certificate on a machine will expiry in the next two weeks. • You need 1 metric and 1 exception • The metric will record the expiry time of the certificate. • The exception will check the metric and decide if it expires in the next two weeks. • The metric needs to be structured in such a way that the correlation unit of the exception sensor can understand it. • Can I record the data as a: • String e.g. “Sun Oct 8 16:05:47 2006” NO(Cannot be converted to a number) • UnixTime e.g. “1160316347” NO(Correlation unit doesn’t understand time, yet!!) • Solution: • Record the number of seconds until the certificate expires. • E.g 1814400 seconds (3 wks) can be mathematical alarmed :- If metric < 1209600 (2 wks) then raise alarm Lemon Tutorial

  22. Considerations (III) • Misconception: • In Lemon that a metric has to be related to one and only one distinct piece of information (1 to 1 mapping) • Not true: • A metric can be associated with multiple values and have multi rows with each row identified by a unique key. Lemon Tutorial

  23. Considerations (IV) – Use Case • Recording partition information Outline: you would like to know the total size, space used in megabytes, space used as a % and the mount options of all mounted partitions on a machine. • Under the idea of a 1 to 1 mapping, that’s 4 metrics per partition. An average machine may have 7 partitions (4x7 = 28 metrics in total). • Why not: • Convert the data into a multi-valued metric? • 7 metrics each reporting 4 values. So, • Metric 1 total_space • Metric 2 space_used_mb • Metric 3 space_used_perc • Metric 4 mount_options Becomes: • Metric A total_space space_used_mb space_used_perc mount_options • Go one step further: • Convert the data into a multi-valued, multi-rowed metric • 1 metric reporting the values for all mount points. So, • Metric A total_space space_used_mb space_used_perc mount_options Becomes: • Metric B mountname1 total_space space_used_mb space_used_perc mount_options • Metric B mountname2 total_space space_used_mb space_used_perc mount_options • …. • Benefits: • Monitoring of new mount points is dynamic, no need for reconfigurations, no need to going through a registration process to get new metric ids. Lemon Tutorial

  24. Example 1 – Hello World Objective: To create a Perl sensor which records the value “Hello World” into Lemon. • Simple sensor to demonstrate: • The generic build framework for sensors. • How to registering your Perl module with the API. • How to register metric classes that your modules provides. • How to store the text “Hello World” for the machine under which the sensor runs into Lemon. • Running and debugging your sensor on the command line. • Functions used: • registerVersion() • registerMetric() • storeSample01() • Documented at: http://lemon.web.cern.ch/lemon/doc/howto/sensor_tutorial.shtml Lemon Tutorial

  25. Example 2 – Service Monitoring Objective: To check if a webpage is available on a remote web server and record the HTTP response code under a service name. • Demonstrates: • The basics of on behalf reporting • The ability to parse configuration arguments • The ability to log messages • Functions used: • registerMetric() • getParam() • log() • storeSample03() Lemon Tutorial

  26. Do’s and Don’ts • Don’t: • Call die() or exit() from inside your sensor. • Open or write to files in locations writeable by non-root users such as /tmp/ • Read from filehandles (e.g sockets) that may block. This will make your sensor unresponsive to requests from the agent. • Never rely on, or have dependencies on files on remote file systems such as AFS (Andrew File System). Your sensor should aim to have as few dependencies as possible • Do’s: • Document your sensor. Refer to the sensor tutorial to see how this can be done automatically for you. • If you have the ability to use a timeout around calls to databases and services like LSF, use it!! • Make your metric classes configurable, avoid hard coded paths to non standard files. • Try to make your sensors as generic as possible so that others can benefit from your work. Lemon Tutorial

  27. Lemon Tutorial Sensor Exception Miroslav Siket, Dennis Waldron http://cern.ch/lemon CERN-IT/FIO-FD

  28. Outline • What is it? • Configuration • Correlation Examples. • Actuators • Dealing with transient alarms. Lemon Tutorial

  29. What is it? • Sensor-exception • An officially supported Lemon sensor coded in C++. • Developed in collaboration between CERN and BARC. • Implements the Lemon alarm protocol. • Has a LEX & YACC correlation engine which allows it to evaluate 1 or more metrics to determine if a problem exists on a machine. • Supports reporting alarms on behalf of other monitored entities. • Allows corrective actions (actuators) up to n-times or within a given time window. • Is the primary interface to inserting alarms into the Lemon framework. The output of the sensor is used by LAS and lemon-host-check. • Provides one and only one metric class “alarm.exception” • Full documentation at: • http://lemon.web.cern.ch/lemon/doc/sensors/exception.shtml Lemon Tutorial

  30. Configuration • The sensor has 6 configuration options: • Correlation • The power behind the sensor exceptions capabilities • This tells the sensor which metrics are involved in the alarm and how they should be evaluated • Actuator • The path to an actuator to run if the correlation string is true. • MaxRuns • The maximum number of times an actuator can run consecutively before a final alarm is generating • Timeout • The maximum number of seconds that an actuator is allowed to run before being terminated by the sensor. • MinOccurs • The minimum number of consecutive times a problem must be present before raising an alarm. • Good for dealing with transient alarms. • Silent • Defines whether the exception should run in silent mode. A silent exception will continue to be evaluated but the result will not be displayed on LAS or lemon-host-check. • Good for testing and deployment of new alarms. Lemon Tutorial

  31. Configuration (II) • Basic format of a correlation is: [entity_name]:<metric_id>:<field_position>         <operator>         <reference_value> ... • Where, • entity_name • An optional parameter, used for reporting on behalf of other entities • The name of the entity (wildcards ‘*’ are supported) • metric_id • The id of the metric to check • field_position • The field to use within the metric. • Allows the correlation to extract a single value from a multi-valued metric • Operater • E.g. ==, !=, >, <, eq, ne, regex, !regex … • reference_value • A string or number used to compare the metric_id:field_position against Lemon Tutorial

  32. Correlation Example (I) • Objective: • To run a actuator when the occupancy of the /tmp partition is greater then 80%. • Involved Metrics • 9104 (system.partitionInfo) • Field 1 = mountname, field 5 = percentage occupancy • Correlation Correlation((9104:1 eq '/tmp') && (9104:5 > 80)) Actuator /usr/local/sbin/clean-tmp-partition -o 75 MaxRuns 3 900 Timeout 300 Lemon Tutorial

  33. Correlation Example (II) • Objective: • To raise an alarm “lemon_agent_wrong” if the memory utilisation, cpu utilisation or number of errors in the agents log file is not within acceptable limits. • Correlation 10004:1 > 600 && (10004:7 > 10 || (10004:8 > 150000 && 4109:3 eq 'i386') || (10004:8 > 600000 && 4109:3 regex '64') || 10007:2 > 50 || 10007:3 > 10 || 10007:4 > 0) If the: (uptime of the agent (10004:1) is greater then 600 seconds) AND (the cpu utilisation of the sensors (10004:7) over the last sampling frequency is greater then 10%) OR (the memory consumed by the sensors (10004:8) is greater then 150 megabytes for machines of architecture type (4109:3) i386 or 600 megabytes for machines of architecture type x86_64) OR (the number of warning messages (10007:2) recorded over the last sampling frequency is greater the 50) OR (the number of error messages (10007:3) recorded over the last sampling frequency is greater the 10) OR (the number of fatal messages (10007:3) recorded over the last sampling frequency is greater the 0) raise an alarm Lemon Tutorial

  34. Actuators • Information: • Run as forked processes. • Are connected to the sensor via a pipe. • All information written to stdout or stderr by the actuator is caught and recorded in the agents log file. • All actuator attempts are logged centrally and recorded locally in the agents log file. • Running shell style actuators: • The system call used to run actuator doesn’t provide shell style conveniences. • To use shell style syntax like *, &&, | etc you must define you actuator like this: Actuator /bin/sh –c \\” /bin/echo ‘This is a demo message from $HOSTNAME’ \\” Lemon Tutorial

  35. Dealing with transient Alarms • Why do we get transient alarms? • By default monitoring isn’t very tolerant of outside interventions • Maybe network issues. • A resource maybe temporarily unavailable. • What can be done? • Use the configuration option MinOccurs • MinOccurs gives an exception a level of tolerance, a delay factor between detecting a problem and raising an alarm Lemon Tutorial

  36. Lemon Tutorial Quattor and Non-Quattor Configuration of the lemon-agent Miroslav Siket, Dennis Waldron http://cern.ch/lemon CERN-IT/FIO-FD

  37. Outline • What is the agent? • How to install the agent • Configuring the agent • Demonstration Lemon Tutorial

  38. What is the agent? • A daemon on every monitored machine that is responsible for: • Launching, scheduling requests and communicating with sensors. • Checking on the status of sensors. • Sending sensor information to the central lemon servers using TCP and/or UDP. • Monitoring itself with the internal MSA sensor. • Caching data locally for use by other lemon tools e.g. lemon-host-check and lemon-cli • Full documentation at: http://lemon.web.cern.ch/lemon/docs.shtml Lemon Tutorial

  39. Configuring the agent • Two supported ways: • Quattor • Configuration is stored in hierarchical templates per domain/cluster/node • NCM framework is used to download configuration XML profile to nodes • NCM components are used to convert the xml profile information into the agents native configuration file structure. • Documented at: http://cern.ch/lemon/doc/howto/lemon_cdb_howto.shtml • Non-Quattor • Best suited for homogeneous sites. • Use default agent and sensor rpms from Lemon • Use rpms for custom sensors/settings • The agent supports a modular style configuration where configuration files are places into sub directories depending on their purpose: • /etc/lemon/agent/metrics/ <- metric configuration • /etc/lemon/agent/sensors/ <- sensor configuration • /etc/lemon/agent/transports/ <- transport configuration • Both the Quattor and Non-Quattor styles of configuration can live together on the same machine. Lemon Tutorial

  40. Demonstration • Installation of the agent and default sensors • rpm –Uvh edg-fabricMonitoring-agent-2.13.0-2.i386.rpm • rpm –Uvh lemon-sensor-exception-1.2.1-2.i386.rpm • Configuration of: • General agent’s settings (/etc/lemon/agent/general.conf) • Servers (transports) (/etc/lemon/agent/udp.conf) • Defining a new sensor • Defining a new metric Lemon Tutorial

  41. Lemon Tutorial lemon-host-check Miroslav Siket, Dennis Waldron http://cern.ch/lemon CERN-IT/FIO-FD

  42. Outline • What is it ? • Demonstration Lemon Tutorial

  43. What is it? • Lemon-host-check is: • The latest Lemon tool. • A tool for checking the current status of all configured exceptions on the machine. • A tool for managing the state of exceptions, with the ability to turn a exceptions off and on, on the fly without the need for reconfiguration of the agent. • The first command you should run whenever you believe monitoring is incorrect!!! • Works by instructing the local agent to refresh all metrics contributing towards exceptions (raw metrics) and then requesting a refresh of all exceptions. • Uses fresh monitoring data. • Fully documented at: http://cern.ch/lemon/doc/components/lemon-host-check.shtml Lemon Tutorial

  44. Demonstration • Installation of lemon-host-check • rpm –Uvh lemon-host-check-1.0.1-7.noarch.rpm • rpm –Uvh edg-fabricMonitoring-mrs-1.0.8-1.i386.rpm • Show how to: • Interpret the information returned by lemon-host-check • Enable and disable exceptions • View pre alarms, running actuators and disabled metrics Lemon Tutorial

  45. Lemon Tutorial FlatMon and OraMon servers Miroslav Siket, Dennis Waldron http://cern.ch/lemon CERN-IT/FIO-FD

  46. Authentication • Flat File (FlatMon ) and Oracle based (OraMon) • Used for both TCP and UDP (stateless connections) • Using OpenSSL libraries with public key methods to authenticate – sign() and verify() methods • Support both RSA/SHA1,MD5, DSA/DSS1 algorithms with different key sizes (default = 1024bit) • Fastest – RSA/SHA1 • X509 would provide too much overhead • Three levels: • 0 – no authentication • 1 – authentication of signed packets, accepts also non-signed packets • 2 – full enforcement of authentication Lemon Tutorial

  47. Authentication - schema Node1 [rsa_encrypt(s.pub_key)] rsa_sign(n1.sec_key) Server1 rsa_verify(metric,n3.pub_key) [rsa_decrypt(s.sec_key)] Node2 [rsa_encrypt(s.pub_key)] rsa_sign(n2.sec_key) Server2 rsa_verify(metric,n1.pub_key) [rsa_decrypt(s.sec_key)] Node3 [rsa_encrypt(s.pub_key)] rsa_sign(n3.sec_key) s.pub_key – server’s public key n(x).sec_key – agent’s secret key n(x).public_key – agent’s public key Lemon Tutorial

  48. Setup of FlatMon Fast overview: • Install server rpm • Setup /etc/lemon/server/edg-fmon-server.conf file • Setup /etc/lemon/server/keys directory with client keys • Check authentication • Check data arriving at server • Check log files for problems Lemon Tutorial

  49. Setup of OraMon Fast overview: • Rpms installation (lemon-ora-admin, lemon-OraMon) • DBA creation of schema (use adapted lemon_user.sql) • Setting up schema for OraMon with lemon-ora.admin • Configuring metadata information (/etc/oramon-server.conf) • Configuring OraMon: • System settings with /etc/sysconfig/OraMon • Access settings with /etc/lemon/server/lemon-oramon-server.conf • Checking the log file for problems • Checking data with lemon-ora.retrieve • Changing the metadata • http://lemon.web.cern.ch/lemon/doc/components/lemon-ora-admin/index.html • http://lemon.web.cern.ch/lemon/doc/components/oramon/index.html Lemon Tutorial

  50. Lemon Tutorial LRF Miroslav Siket, Dennis Waldron http://cern.ch/lemon CERN-IT/FIO-FD

More Related