1 / 51

Enterprise Command Center

Enterprise Command Center. Presenter: Chris Rogers. ECC—Early days—Why?. Enterprise Command Center (ECC) Missions. Increase availability of IT services Improve communication Reduce mean time to repair Reduce the number of problem calls. Clients. University of New Mexico

harryb
Download Presentation

Enterprise Command Center

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Enterprise Command Center Presenter: Chris Rogers

  2. ECC—Early days—Why?

  3. Enterprise Command Center (ECC) Missions • Increase availability of IT services • Improve communication • Reduce mean time to repair • Reduce the number of problem calls

  4. Clients • University of New Mexico • Information Technology Services • Communication and Network Services • The Albuquerque GigaPop • Pilot with the New Mexico State Government

  5. ECC Strategy • Phase 1 - Get visibility into the environment • Phase 2 - Establish network troubleshooting and fault • Phase 3 - Establish application troubleshooting and fault • Phase 4 - Ticket responsible party alert all upstream departments to include functional departments • Phase 5 - Work on prediction technology • Phase 6 - Institute SLA matrix

  6. Services • Fully functional Network Operations Center • Proactively provide end-to-end management of IT services. • Cross silo troubleshooting services

  7. Network Operations Center (NOC)What we do (Phase 2) • Monitor 800 network devices supporting 35,000 end user nodes • Monitor 11 WAN sites and connections to Internet 2, LamdaRail, ESNet, and the commodity Internet

  8. NOC FunctionsWhat we do it with • Traditional Network Fault Management • Spectrum from Computer Associates (Concord, Aprisma) • Trending and Analysis • eHealth from Computer Associates (Concord) • Arbor SP and Arbor X • End User Simulation • Automate • Application Response(AR) from Computer Associates (Concord)

  9. NOC FunctionsWhat we do it with • Public Interfaces • PHP Weather Map • Joomla • Supplementary Programs • Groundworks (Nagios) • Sniffers from Network General and Ethereal

  10. Where to Start • Auto Discovery • Modeling • Connectivity • Integration • Fault Identification – Root Cause • Event Suppression

  11. Spectrum Auto Discovery Starts with a Seed Router Discovers all of the next hop Routers within a defined network range using IP routing tables

  12. Spectrum Auto Discovery Along the way each Router is interrogated

  13. Spectrum Auto Discovery Spectrum Model Physical characteristics Routing Switching Cards Interfaces Router Sub-Interfaces Functional characteristics T1 Card Routing T1 6/0 T1 7/0 T1 8/0 Switching VLANs SE6/0.1 SE8/0.1 SE6/0.2 SE8/0.1 SE8/0.1 Along the way each device is interrogated and a relational model is created with the routers physical characteristics, abilities interfaces, and even sub interfaces

  14. Spectrum Auto Discovery Next Spectrum determines adjacency

  15. Spectrum Auto Discovery Next Spectrum determines adjacency and an adjacency model is created at the port – sub/port level

  16. Spectrum Auto Discovery VLAN101 VLAN102 This includes Identifying VLAN connections and redundant paths

  17. Spectrum Auto Discovery VLAN101 VLAN102 This includes Identifying VLAN connections and redundant paths

  18. Spectrum Auto Discovery This way correct connectivity can be established This continues to the switching and bridging layers

  19. Network Fault Management Spectrum SNMP ICMP Spectrum Polls network devices like every other network poller. But once a fault is detected Spectrum begins interrogating the device and the devices adjacent to it to determine if it is a false alarm or to determine the root cause of the outage

  20. Network Fault Management Spectrum Switch 1 ICMP ICMP ICMP ICMP ICMP SNMP Switch 2 Switch 3 Spectrum sends ICMP packets to the down stream devices with failure. Switch 4 Spectrum sends a ICMP packets to the upstream stream devices with success. Switch 5 Spectrum then interrogates the upstream device to determine a reason for not being able to reach the down switch in this case it is identified that the port on the upstream switch has been placed administratively down Spectrum creates a customizable alarm and places the downstream devices in an unreachable state to suppress any errors to them

  21. Spectrum Ping Failure Switch 3 Ping Failure Switch 1 Ping Failure Switch 4 Ping Failure Switch 2 Ping Failure Switch 5 Traditional NMS Ping Failure Switch 3 Ping Failure Switch 1 Ping Failure Switch 4 Ping Failure Switch 2 Ping Failure Switch 5 Event Suppression • Alarms • Switch 1 Down • Alarms • Switch 3 Down • Switch 1 Down • Switch 4 Down • Switch 2 Down • Switch 5 Down

  22. Network Fault Management Spectrum • In addition to root cause and event suppression • Spectrum handles • Maintenance periods • Customizable alarm management and filtering • Customizable per alarm and device scripting

  23. Service Based Monitoring • What is important to the user • Test from a users perspective • If a service goes down and nobody notices did it really go down • If a service looks like it is down, although everything is working fine, ITS DOWN!!! • End user testing is problematic • Use a 2 out of 3 approach

  24. Automated end user simulation The truest measure of an applications workability is whether the users whom use that application believe it is working normally. Therefore the end user perception of speed and availability are the Prime metrics to measure end user happiness quantitatively. These Prime metrics are the holly grail to determine if there really has been a performance or reliability impact or if the user is just complaining.

  25. Automated end user simulation So the first question is how does the end user determine speed? In our case most of our applications are web based. So speed is determined by the responsiveness of the website to perform tasks. Therefore we decided to test the time it takes to perform a simple end user function using automation that simulates an end user.

  26. Automated end user simulation We replace the end user with automation software which reacts to the application just an end user would.

  27. Automated end user simulation Client 320 ms Network 44 ms Server 300 ms We then benchmark the response time of the test every 5 minutes By examining the network packets eHealths AR agent determines where the time in the response time is consumed. (Client, Network, Server) In troubleshooting this is our first line of defense in identifying the cause of a slowdown in performance. (Is it the Client, Network, or Servers)

  28. Automated end user simulation Robot2  Client 320 ms Network 44 ms Server 30 ms We also perform this test from at least 3 different machines (robots). This absorbs any variance from normal end user random behaviors. Notice the network time for Robot2

  29. Automated end user simulation Robot2  Network path between Robot 2 and Application Servers Look for a cause of the delay between Robot 2 and the Application Server

  30. Spectrum eHealth Integration Spectrum integrates with eHealth to populate devices for discovery After being discovered in eHealth trending information is available by right clicking any discovered device

  31. Spectrum eHealth Integration VLAN101 VLAN102 This includes trending information on the interface, sub-interface, and VLAN interfaces Additionally drilling into this graph creates further definition of the utilization on this link.

  32. eHealth Trending This T1 link appears to be at 100%. It could be the cause of the delay in response time. OK, that is nice to know. The next question would be “Well what do we do about it”

  33. eHealth and Arbor Do we Increase the bandwidth? To give us the data to answer these questions we need to find out what this utilization really is. Introducing Arbor. Arbor is a netflow based packet analyzer, which reports on who is doing what where.

  34. Automated end user simulation Robot2  Netflow is collected from centralized routers Only traffic traversing through these points are seen

  35. Arbor SP Who internally is using this traffic?

  36. Arbor SP What is this traffic?

  37. Arbor SP Which direction did it go from my router?

  38. Arbor SP Which peer network did it use?

  39. Arbor X • This graph shows the actual successful traffic traversing the T1

  40. Automated end user simulation Robot2  In our case the traffic is a DOS Attack We blocked this port at our boundary router The response times for robot 2 returned to normal

  41. Phase 3Migrating to a Service Based Monitoring Center • Separating the network from the application. • Moving end user simulation out of the network and closer to the application • Test the network using Cisco’s IP SLA (SAA) and test from every end device

  42. Automated end user simulation With the addition of Spectrum we will stop using the robots to identify network problems. We are centralizing the robots and use them to report on application response time only.

  43. Service Based Monitoring SAA Test SAA Test SAA Test SAA Test SAA Test SAA Test SAA Test We are using Cisco IP SLA (SAA) to test network delay. This will allow tests to be performed from every end of the network, not just the ones with robots.

  44. Show Public ViewApp 1 Application Response Last Two Hours Network Response Per Building

  45. Troubleshooting Network Response Time 54ns 50ns 50ns 2ns 54ns 2ns 34ns 29ns 2ns 33ns 30ns 32ns 33ns Weathermap

  46. Cross Silo Troubleshooting • At UNM our technical teams are organized into functional silos. These silos become very specialized and detailed within their fields. • This specialization creates the problem of not becoming cross trained in other areas of expertise. • The ECC is purposely staffed by veterans who though knowledgeable in their respective areas are also “Jack of all trades” type people.

  47. Cross Silo Troubleshooting Response Time Metric Bridge the gap Client Application Server Application Networking Windows Linux Storage Database Virtualization

  48. Cross Silo Troubleshooting Response Time Metric Data

  49. Future Services • Provide metrics and alerting of Service Level Management to our Information Assurance team. • Add VOIP as a new monitored application. • Additional Services

  50. Reality Sets in • 2nd Level Support • 24x7x365 • No Resources—Banner Applications • Self Sustainability—Deeper level • Automation Balance

More Related