1 / 94

LHCb PC Farm Monitoring and Control System

LHCb PC Farm Monitoring and Control System. Domenico Galli, Bologna. JCOP Project Team Meeting Genève, 28 April 2005. Outline. Overview, aim of the system, software framework, software component and their deployment, releases. Guidelines followed in software development. Main Components:

suchi
Download Presentation

LHCb PC Farm Monitoring and Control System

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. LHCb PC Farm Monitoring and Control System Domenico Galli, Bologna JCOP Project Team Meeting Genève, 28 April 2005

  2. Outline • Overview, aim of the system, software framework, software component and their deployment, releases. • Guidelines followed in software development. • Main Components: • Task Manager • Monitoring System • IPMI Power Manager • Utilities: • Logger • Process Controller LHCb PC Farm Monitoring and Control System. 2 Domenico Galli

  3. Overview • LHCb PC Farm Monitoring and Control System has been designed mainly for the LHCb L1/HLT event filter farm. • The system has been developed for Linux PCs but will be ported to MS Windows in order to be used also for PC involved in LHCb detector monitor and control. • It uses DIM (Distributed Information Management System) as communication layer and is accessible both through a command line interface and through a PVSS graphical interface. LHCb PC Farm Monitoring and Control System. 3 Domenico Galli

  4. Aim of the System • Monitoring • Display of relevant parameters concerning the status of the farm. • Induce the transition of a state machine to an alarm state when the monitored parameters indicate error/warning conditions. • Control • Action execution (system reboot, process start/stop, etc.) triggered by manual command or by a state machine transition. LHCb PC Farm Monitoring and Control System. 4 Domenico Galli

  5. DIM PVSS (Prozessvisualisierungs- und Steuerungs-System) • To build an interface of the farm monitor system coherent with the monitor of the detector hardware, we make use of PVSS SCADA (Supervisory Control and Data Acquisition) tool. • PVSS provides: • runtime DB, automatic archiving of data to permanent storage; • alarm generation; • easy realization of graphicalpanels; • various protocols tocommunicate via network. LHCb PC Farm Monitoring and Control System. 5 Domenico Galli

  6. Sensors and Actuators • PVSS need to be interfaced with farm nodes: • to receive monitor data; • to issue command to the nodes; • On each node a few light processes (light-weight servers) runs: • monitor sensors; • command actuators. • PVSS-to-nodes interface is achieved using DIM light-weight network communication layer. LHCb PC Farm Monitoring and Control System. 6 Domenico Galli

  7. PVSS sensor actuator Farm node DIM (Distributed Information Management System) • DIM network communication layer is already integrated with PVSS: • It is light-weight and efficient. • It allows bi-directional communication. • It uses a name server for services/commands publication and subscription. LHCb PC Farm Monitoring and Control System. 7 Domenico Galli

  8. Monitoring and Control System Components • Light-weight servers (to be installed on each farm node): • Task Manager Server • 9 Monitor Servers • Logger Server • Control software to be installed on each control PC: • Power Manager (IPMI) Server • Process Controller • Command line clients (can run on any node on the network): • 3 Task Manager Clients • 2 Power Manager Clients • 1 Logger Client • PVSS clients (can run on any node on the network): • Task Manager & logger Panel. • 9 Monitor Panels. • Power Manager Panel. LHCb PC Farm Monitoring and Control System. 8 Domenico Galli

  9. Sub-Farm 1 Control PC Sub-Farm node Sub-Farm node Sub-Farm node DIM-DNS DIM server DIM server DIM server DIM server sensor sensor sensor sensor IPMI BMC(firmware) IPMI BMC(firmware) IPMI BMC(firmware) IPMI DIMServer actuator actuator actuator actuator PVSS-FSM PVSS-DIM PVSS EVM Monitor Console PC PVSS Dist PVSS Dist PVSS Dist PVSS Remote UI PVSS DBM PVSS CTRL PVSS ARCH Global Control PC Sub-Farm 2 Control PC Monitoring and Control System Component Deployment LHCb PC Farm Monitoring and Control System. 9 Domenico Galli

  10. Releases • 0.1 – 22 November 2004 • 0.2 – 13 January 2005 • New Task Manager features • Process Controller • 1.0 – before June 2005 • IPMI Power Manager. • New PVSS panels. • PVSS panels to set the threshold for Finite State Machine. • PVSS trend plots and archiving. LHCb PC Farm Monitoring and Control System. 10 Domenico Galli

  11. Sensor Access to System Data • In Linux kernel 2.4 procfs is the only file system-like interface to internal data structures in the kernel (the other interface is ioctl()/sysctl()). • In Linux kernel 2.6 sysfs has been added to shows all of the devices (virtual and real) and their inter-connectedness within the system. • In forthcoming kernel versions: • procfs: will contain only process statistics. • sysfs: will contain device statistics (network interfaces, TCP/IP stack, temperatures and fan speed, SCSI interfaces, etc.). LHCb PC Farm Monitoring and Control System. 11 Domenico Galli

  12. Sensor Access to System Data (II) • At present, in kernel 2.6: • network interfaces on both procfs and sysfs; • temperatures and fan speed on sysfs; • TCP/IP stack, CPU states, memory, on procfs. • Monitor sensors at present use procfs (excluding temperature sensor which uses sysfs in the version for the kernel 2.6). • In the future probably most of the sensors (all but prosess sensor) should be modified to access sysfs. LHCb PC Farm Monitoring and Control System. 12 Domenico Galli

  13. Sensor Access to System Data (III) • In MS Windows the interface to internal data structures in the kernel is different. • A procfs interface is provided by cygwin, but is rather poor. • We are investigating on using WMI API (Windows Management Interface, .NET platform) to access internal data structures in the kernel for monitoring. LHCb PC Farm Monitoring and Control System. 13 Domenico Galli

  14. Guidelines for Light-Weight Server Developement • Guidelines followed in sensors and actuators development: • Function written in plain C with particular control of memory allocation (e.g., if possible, memory is allocated once and for all during sensor initialization). • Low level access (not stream access) to procfs and sysfs and one-shot data read. • Each time a (even partial) read operation is performed on the procfs/sysfs the kernel takes time to produce the entire set of data provided in the file. • When possible for complex tasks use maintained libraries (like libprocps) to cope with changes in kernel version. LHCb PC Farm Monitoring and Control System. 14 Domenico Galli

  15. Task Manager • It is a tool to start, list and stop processes on every farm node from a central console. • It uses TCP as transport protocol, through the DIM network communication layer. • To track processes (e.g. to list or to stop them) it set an additional environment variable (a descriptive string, named UTGID, User Assigned Thread Group Identifier) to every started process. • no more then one process can be started with the same UTGID. • This way, the stamp imposed to processes survives to an incidental Task Manager crash. • UTGID can be defined by the user or can be automatically generated as <binary executable image>_<instance #> LHCb PC Farm Monitoring and Control System. 15 Domenico Galli

  16. Task Manager Components • tmSrv: Task Manager Server (to be executed on each farm node). • tmStart, tmLs, tmKill, tmStop: Task Manager command-line clients (can be executed on any PC on the network). • tableViewUTGID.pnl, startPanel.pnl, killPanel.pnl, stopPanel.pnl: PVSS GUI clients (can be executed on any PC on the network with a PVSS remote UI). LHCb PC Farm Monitoring and Control System. 16 Domenico Galli

  17. Task Manager Features • The started process can have a clean environment (except UTGID) or can inherit the task manager environment. • An arbitrary number of new evironment variables can be added to started processes. • The stdout and stderr of started processes can be thrown to /dev/null or can be redirected to the logger. • It can start processes as daemons (process group leader, i.e. new session ID, umask reset, no controlling tty, ignored SIGCLD). • It can set the scheduler of the started processes (time sharing, fifo or round-robin). LHCb PC Farm Monitoring and Control System. 17 Domenico Galli

  18. Task Manager Features (II) • It can set the nice level (for the evaluation of the dynamic priority) to the processes started with the time sharing scheduler. • It can set the static (real-time) priority to processes started with fifo and round-robin scheduler. • It can set the username of the started processes. • It is immediately signaled about the termination of a started process (through the SIGCHLD signal) and: • It logs alternatively the exit code or the number of the signal which caused the process to stop; • It immediately refresh the list of started processes (published using DIM). LHCb PC Farm Monitoring and Control System. 18 Domenico Galli

  19. Task Manager Features (III) • The kill DIM CMD can send a chosen signal to a single process or to all the processes whose UTGID matches a certain POSIX.2 wildcard pattern. • The stop DIM CMD sends a chosen signal to a process and returns immediately but it also triggers off the deferred sending of a SIGKILL signal to the same process (in a different thread). • This way the process is left the chance to exit gracefully on a SIGTERM reception, but, if it fails, it is stopped abruptly by a SIGKILL after a certain, chosen, delay. LHCb PC Farm Monitoring and Control System. 19 Domenico Galli

  20. Task Manager Command Line Interface • tmStart [-m hostname_pattern][-c][-D NAME=value...][-d] [-s scheduler][-p nice_level][-r rt_priority] [-n user_name][-u utgid][-o][-e][-w wd] path [arg...] • Starts a new process on one or more farm PCs. • tmLs[-m hostname_pattern][utgid_pattern] • Lists processes which have UTGID set. • tmKill[-m hostname_pattern][-s sig] utgid_pattern • Sends a signal to process(es). • tmStop[-m hostname_pattern][-s sig][-d delay] utgid_pattern • Sends a signal to process(es). • If the process(es) is not dead after delay seconds, sends a SIGKILL signal. Without blocking the client. • Recognize POSIX.2 wildcard pattern (*, ?, character classes [027], ranges [3-7], complementation [!027] or [!3-7]) in hostname and UTGID. LHCb PC Farm Monitoring and Control System. 20 Domenico Galli

  21. Task Manager PVSS interface LHCb PC Farm Monitoring and Control System. 21 Domenico Galli

  22. Task Manager PVSS interface (II) LHCb PC Farm Monitoring and Control System. 22 Domenico Galli

  23. Task Manager PVSS interface (III) LHCb PC Farm Monitoring and Control System. 23 Domenico Galli

  24. Task Manager PVSS interface (IV) LHCb PC Farm Monitoring and Control System. 24 Domenico Galli

  25. Task Manager PVSS interface (V) LHCb PC Farm Monitoring and Control System. 25 Domenico Galli

  26. Monitor Sensors • 9 light-weight monitor sensors for nodes developed: • Temperatures and fans speeds; • CPU info (CPU number, brand and model, cache size, clock). • CPU states (user, system, nice, idle, iowait, irq, softirq); • Hardware interrupt rates (separately per CPU and per irq source); • Memory usage; • Process status (including scheduling class and real time priority); • Network Interface Card counters’ rates and error fractions; • Network Interface Interrupt Coalescence • TCP/IP stack rates and error fraction. LHCb PC Farm Monitoring and Control System. 26 Domenico Galli

  27. 1 - Temperature and Fan Speed Sensor • It collects temperature of the sensors integrated on the motherboard and fan speeds. • Uses lm_sensors software. • Needs bus drivers (for ISA or I2C/SMBus) and sensor chip drivers. • Integrated in Linux kernel tree since kernel 2.6. • This server has been developed in 2 versions: one for Linux kernel 2.4 which get data from procfs and the other for Linux kernel 2.6, which get the data from sysfs. • E.g., on sysfs: • /sys/devices/platform/i2c-0/0-0290/temp1_input • /sys/devices/platform/i2c-0/0-0290/fan1_input LHCb PC Farm Monitoring and Control System. 27 Domenico Galli

  28. 1 - Temperature and Fan Speed Sensor (II) • Hardware compatibility issues: • Most bus drivers have been ported to kernel 2.6; • However, only 43% of the chip drivers have been ported. • The IPMI alternative: • IPMI (Intelligent Platform Management Interface) v1.5 can get the same information in a more portable way (it is OS-independent). • The configuration of the IPMI environment monitoring system is responsibility of the hardware vendor (more uniformity and reliability of data is expected). • Probably, in new software releases, temperatures and fan speeds will be collected using IPMI LAN interface. LHCb PC Farm Monitoring and Control System. 28 Domenico Galli

  29. 1 – Temperature and Fan Speed Sensor: PVSS Interface Node hostname LHCb PC Farm Monitoring and Control System. 29 Domenico Galli

  30. 1 – Temperature and Fan Speed Sensor: PVSS Interface (II) LHCb PC Farm Monitoring and Control System. 30 Domenico Galli

  31. 2 – CPU Information Sensor • This server provides static informations about the CPU(s), e.g.: • The CPU brand and model identifier string; • The CPU family, model and sub-version (revision) identifier; • The clock frequency of the CPU and the CPU cache size. • The number of hyper-threading cores in that physical CPU. • The CPU computational power in bogomips. LHCb PC Farm Monitoring and Control System. 31 Domenico Galli

  32. 2 – CPU information Sensor: PVSS Interface right-click LHCb PC Farm Monitoring and Control System. 32 Domenico Galli

  33. 3 - CPU States Sensor • It collects, and evaluates as a percentage, both the aggregate values and the specific per-CPU values of the fraction of time spent by the CPUs performing different kinds of work: • user: normal processes executing in user mode. • nice: niced processes executing in user mode. • system: processes executing in kernel mode. • idle: not working. • iowait: waiting for I/O to complete. • irq: servicing interrupts (only in kernel ≥ 2.6). • softirq: servicing softirqs (only in kernel ≥ 2.6). • Moreover it collects the global context switch rate. (useful to check the operation of process scheduling). LHCb PC Farm Monitoring and Control System. 33 Domenico Galli

  34. 3 - CPU States Sensor: PVSS Interface LHCb PC Farm Monitoring and Control System. 34 Domenico Galli

  35. 3 - CPU States Sensor: PVSS Interface LHCb PC Farm Monitoring and Control System. 35 Domenico Galli

  36. 4 - Hardware Interrupt Sensor • It collects the interrupt rates issued by hardware device drivers. • Data are partitioned per-CPU and per-driver (timer/PIC/local APIC, rtc, eth0, eth1, etc.). • Average values and maximum values (since DIM server startup) are evaluated. • Useful to control IRQ-to-CPU affinity of the network interfaces. LHCb PC Farm Monitoring and Control System. 36 Domenico Galli

  37. 4 - Hardware Interrupt Sensor: PVSS Interface kernel 2.6 LHCb PC Farm Monitoring and Control System. 37 Domenico Galli

  38. 5 - Memory Usage Sensors • It collects memory usage statistics. • Available quantities depends on kernel version. • Main collected quantities (more details): • Total/Low/HighMemory occupation. • Disk cache. • Virtual memory management. • Swapping and paging. • Vmalloc. LHCb PC Farm Monitoring and Control System. 38 Domenico Galli

  39. 5 - Memory Usage Sensors: PVSS Interface More recently used Still not copied to disk over-committing LHCb PC Farm Monitoring and Control System. 39 Domenico Galli

  40. 5 - Memory Usage Sensors: PVSS Interface (II) right-click LHCb PC Farm Monitoring and Control System. 40 Domenico Galli

  41. 6 - Process Status Sensor • It collects for each task the process status, like “top” or “ps”. • Must cope with deep changes in Linuxthreading model. • LinuxThreads, kernel ≤ 2.4.19: • Each thread has a unique process ID (PID). • getpid() function therefore returns different values (PID) for the different threads of the same process. • NPTL, Native POSIX Threading Library, kernel ≥ 2.4.20 • Each thread has a unique identifier called TID (Thread Identifier) while the PID has been replaced by TGID (Thread Group Identifier). • getpid() function therefore returns the same value (TGID) for all threads in a process. LHCb PC Farm Monitoring and Control System. 41 Domenico Galli

  42. 6 - Process Status Sensor (II) • Following changes in threading model, process data format in procfs depends on kernel version: • /proc/<pid>/… until kernel ≤ 2.4.19. • /proc/<tgid>/task/<tid>/… starting with kernel ≥ 2.4.20. • To cope with changes in kernel version, the process status sensor access kernel data in procfs by means of the maintained (but undocumented) library libproc-3.2.3.so: • from http://procps.sourceforge.net/ • (more details) LHCb PC Farm Monitoring and Control System. 42 Domenico Galli

  43. 6 - Process Status Sensor: PVSS Interface (basic panel) TS: time sharing (Linux Default) No UTGID (not started by the Task Manager) 3-thread process: same TGID, same UTGID, different TIDs LHCb PC Farm Monitoring and Control System. 43 Domenico Galli

  44. 6 - Process Status Sensor: PVSS Interface (advanced panel I) LHCb PC Farm Monitoring and Control System. 44 Domenico Galli

  45. 6 - Process Status Sensor: PVSS Interface (advanced panel II) SIZE: size of the core image of the task (code+data+stack)VSIZE: virtual memory usage (lib+exe+data+stack)RSS: Non-swapped physical memory LHCb PC Farm Monitoring and Control System. 45 Domenico Galli

  46. 6 - Process Status Sensor: PVSS Interface (advanced panel III) signals LHCb PC Farm Monitoring and Control System. 46 Domenico Galli

  47. 6 - Process Status Sensor: PVSS Interface (advanced panel IV) processor S: Interruptible sleeps: session leaderl: multi-threaded LHCb PC Farm Monitoring and Control System. 47 Domenico Galli

  48. 7 - Network Interface Sensor • It collects Network Interface Card counters and evaluates transmission rates and error fractions. • It reads also interface name (eth0, eth1, etc.), IP address and MAC address of the system boards. • Cope with the problem of 32-bit rx_bytes/tx_bytes counters in kernel when a Gigabit Ethernet interface is transmitting/receiving at full speed (but sensor must be called at least once every 34 s). • Average values and maximum values (since DIM server startup) are evaluated and can be reset. • (more details) LHCb PC Farm Monitoring and Control System. 48 Domenico Galli

  49. 7 - Network Interface Sensor: PVSS Interface bit/s frame/s bytes/frame If ≠ 0, cable problem or loose connector LHCb PC Farm Monitoring and Control System. 49 Domenico Galli

  50. 7 - Network Interface Sensor: PVSS Interface (II) LHCb PC Farm Monitoring and Control System. 50 Domenico Galli

More Related