1 / 21

MonALISA Team Iosif Legrand , Harvey Newman, Ramiro Voicu ,

MonALISA capabilities for the LHCOPN. MonALISA Team Iosif Legrand , Harvey Newman, Ramiro Voicu , Costin Grigoras , Ciprian Dobre , Alexandru Costan. USLHCNet Team Harvey Newman, Artur Barczyk , Ramiro Voicu , Azher Mughal , Sandor Rozsa. LHCOPN meeting March 2010 London.

ayalar
Download Presentation

MonALISA Team Iosif Legrand , Harvey Newman, Ramiro Voicu ,

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MonALISA capabilities for the LHCOPN MonALISA Team IosifLegrand, Harvey Newman, Ramiro Voicu, CostinGrigoras, CiprianDobre, AlexandruCostan USLHCNet Team Harvey Newman, ArturBarczyk, Ramiro Voicu, AzherMughal, SandorRozsa LHCOPN meeting March 2010 London

  2. Outline • MonALISA Framework • Architecture • Data handling • Automatic actions • USLHCNet • Network topology • Monitoring modules • Reliable monitoring & accounting • Alarms & triggers • Conclusions 2 Ramiro Voicu LHCOPN London March 2010

  3. The MonALISA Architecture Regional or Global High Level Services, Repositories & Clients HL services Secure and reliable communication Dynamic load balancing Scalability & Replication AAA for Clients Proxies Distributed System for gathering and analyzing information based on mobile agents: Customized aggregation, Triggers, Actions Agents MonALISA services Distributed Dynamic Registration and Discovery-based on a lease mechanism and remote events Network of JINI-Lookup Services Secure & Public Fully Distributed System with no Single Point of Failure 3 Ramiro Voicu LHCOPN London March 2010

  4. MonALISA Service & Data Handling Postgres Data Store Lookup Service Lookup Service Registration Data Cache Service & DB Web Service WSDL SOAP Discovery WS Clients and service Data (via ML Proxy) Predicates & Agents Clients or Higher Level Services Configuration Control (SSL) Applications AGENTS FILTERS / TRIGGERS Dynamic (Re)Loading Collects any type of information Monitoring Modules Push and Pull 4 Ramiro Voicu LHCOPN London March 2010

  5. Two levels of decisions: local (autonomous), global (correlations). Actions triggered by: values above/below given thresholds, absence/presence of values, correlations between any values. Action types: alerts (emails/instant msg/atom feeds), running an external command, automatic charts annotations in the repository, running custom code, like securely ordering a ML service to (re)start a site service. Local and Global Decision Framework • Traffic • Jobs • Hosts • Apps ML Service Actions based on global information Global ML Services Actions based on local information • Temperature • Humidity • A/C Power • … ML Service Sensors Local decisions Global decisions Ramiro Voicu LHCOPN London March 2010

  6. USLHCNet • USLHCNet provides transatlantic connections of the Tier1 computing facilities at Fermilab and Brookhaven with the Tier0 and Tier1 facilities at CERN as well as Tier1s elsewhere in Europe and Asia. • Together with ESnet, Internet2 and the GEANT, USLHCNet supports connections between the Tier2 centers. • The USLHCNet core infrastructure is using the Ciena Core Director devices that provide time-division multiplexing and packet-forwarding protocols that support virtual circuits with bandwidth guarantees. The virtual circuits offer the functionality to develop efficient data transfer services with support for QoS and priorities. • Hybrid network: uses both Ciena CD and Force10 routers • 6 transatlantic 10G links at the moment Ramiro Voicu LHCOPN London March 2010

  7. USLHCnet ML weather map Ramiro Voicu LHCOPN London March 2010

  8. Monitoring modules We developed a set of monitoring modules for USLHCNet network devices: • Force10 (SNMP & sFlow) • Traffic per interface • sFlow traffic • Link status monitoring • Ciena Core Director (TL1 – Transaction Language1) • ETTP (Ethernet Termination Point) traffic • EFLOW (Ethernet Flow) traffic • OSRP (routing protocol) topology • VCG Provisioned / Available Bandwidth • Dynamic circuits inside the optical core of the network • Ping module/MLPing trigger which sends alarms in case of packet loss Ramiro Voicu LHCOPN London March 2010

  9. USLHCnet monitoring MonALISA @GVA MonALISA @AMS SNMP SNMP TL1 MonALISA @NYC MonALISA @CHI Ramiro Voicu LHCOPN London March 2010

  10. USLHCnet redundant monitoring MonALISA @GVA MonALISA @AMS Each Circuit is monitored at both ends by at least two MonALISA services; the monitored data is aggregated by global filters in the repository MonALISA @NYC MonALISA @CHI Ramiro Voicu LHCOPN London March 2010

  11. Local and global filters • Based on the MonALISA actions framework a set of triggers have been deployed inside the service to notify by email, SMS and IM the USLHCNet network engineers in case of problems • The filters developed for USLHCNet repository aggregate the redundant monitoring data (traffic and link status) collected from all the MonALISA services • The link status is computed as a logical “AND” between both end points of a link. This also cross checks the status reported by the hardware equipment. • We collect data in two repository instances, each with replicated database back-ends. These instances are dynamically balanced in DNS. Ramiro Voicu LHCOPN London March 2010

  12. USLHCnet: Precise measurements for the Operational Status on the WAN Link • Operations & management assisted by agent-based software • Used on the new CIENA equipment used for network managment Ramiro Voicu LHCOPN London March 2010

  13. USLHCnet: ALL EFLOW traffic - last 2 months Ramiro Voicu LHCOPN London March 2010

  14. USLHCnet: Accounting for Integrated Traffic Ramiro Voicu LHCOPN London March 2010

  15. USLHCnet: Ciena alarms monitoring Ramiro Voicu LHCOPN London March 2010

  16. NETWORKS ROUTERS AS Topology monitoring and discovery Real Time Topology Discovery & Display Ramiro Voicu LHCOPN London March 2010

  17. Storage discovery in Alice • distance(IP, IP) • Same IP-class network • Common domain name • Same AS • Same country (+ function of RTT between the respective AS-es if known) • If distance between the AS-es is known, use it • Same continent • Far away • distance(IP, Set<IP>): Client's public IP to all known IPs for the storage France Nordic Countries Italy Russia USA C. Grigoras (Alice) – ACAT 2010 Ramiro Voicu LHCOPN London March 2010

  18. FDT Bandwidth tests in Alice (E2E avbw) http://monalisa.cern.ch/FDT/ Newer kernel Tuned TCP Buffers 1 Gbps network card Default kernels Default TCP Buffers Different trends = different kernels 100 Mbps network card Ramiro Voicu LHCOPN London March 2010

  19. Conclusions http://monalisa.caltech.edu http://repository.uslhcnet.org • The MonALISA framework provides a flexible and reliable monitoring infrastructure • 350+ installed services, 1.5M+ unique parameters, 25kHz value updates • Truly distributed architecture with no single points of failure • Highly modular platform • Automatic decision taking capability at both local and global levels • USLHCNet provides a hybrid network with support for circuit oriented network services • Monitoring this infrastructure proved to be a challenging task, but we are running with 99.5+% monitoring uptime (100% in the last 6 months) • We are investigating dynamic provisioning of circuits from collaborating agents Ramiro Voicu LHCOPN London March 2010

  20. Monitoring Optical Switches Dynamic restoration of lightpath if a segment has problems Ramiro Voicu LHCOPN London March 2010

  21. Controlling Optical Planes Automatic Path Recovery CERN Geneva USLHCnet Internet2 Starlight CALTECH Pasadena Manlan 200+ MBytes/sec From a 1U Node FDT Transfer “Fiber cut” simulations The traffic moves from one transatlantic line to the other one FDT transfer (CERN – CALTECH) continues uninterrupted TCP fully recovers in ~ 20s 4 2 3 1 4 fiber cut emulations 4 Fiber cuts simulations Ramiro Voicu LHCOPN London March 2010

More Related