1 / 25

ENMA: Co-operation in the corporation

ENMA: Co-operation in the corporation. Mort (Richard Mortier) MSR-Cambridge September 2004. Network management. …is the process of monitoring and controlling a large complex distributed system of dumb devices where failures are common and resources scarce

carlton
Download Presentation

ENMA: Co-operation in the corporation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ENMA:Co-operation in the corporation Mort (Richard Mortier) MSR-Cambridge September 2004

  2. Network management • …is the process of monitoring and controlling a large complex distributed system of dumb devices where failures are common and resources scarce • Enterprise networks are large but closely managed • Contrast with the Internet or university campus networks • No-one has the big picture! • Internet routeing uses distributed protocols • Current management tools all consider local info • Patchy SNMP support, configuration issues, sampling artefacts, tools generate CPU and network load

  3. This project • Building edge-based network management platform • Collect flow information from hosts, and • Combine with topology information from routeing protocols • Enable visualization, analysis, simulation, control • Avoid problems of not-quite-standard interfaces • Management support is typically ‘non-critical’ (i.e. buggy ) and not extensively tested for inter-operability • Do the work where resources are plentiful • Hosts have lots of cycles and little traffic (relatively) • Protocol visibility: see into tunnels, IPSec, etc

  4. Problem context: Enterprise networks • Large • 105 edge devices, 103 network devices • Geographically distributed • Multiple continents, 102 countries • Tightly controlled • IT department has (nearly) complete control over user desktops and network connected equipment

  5. Talk outline • System outline • What would it be good for? • In more detail… • Research issues

  6. routes srcs dsts System outline Packets Routeing protocol Flows Topology Traffic matrix Set of routes Distributed database Simulator Control Visualize Simulate

  7. Where is my traffic going today? • Pictures of current topology and traffic • Routes+flows+forwarding rules  BIG PICTURE • In fact, where did my traffic go yesterday? • Keep historical data for capacity planning, etc • A platform for anomaly detection • Historical data suggests “normality”, live monitoring allows anomalies to be detected

  8. Where might my traffic go tomorrow? • Plug into a simulator back-end • Discrete event simulator, flow allocation solver • Run multiple ‘what-if’ scenarios • …failures • …reconfigurations • …technology deployments • E.g. “What happens if we coalesce all the Exchange servers in one data-centre?”

  9. Where should my traffic be going? • Close the loop: compute link weights to implement policy goals • Recompute on order of hours/days • Allows more dynamic policies • Modify network configuration to track e.g. time of day load changes • Might make network more efficient(~cheaper)

  10. Where are we now? • Three major components • Flow collection • Route collection • Distributed database • Still studying feasibility • Starting to build prototypes

  11. Data collection • Flow collection • Hosts track active flows • Using low overhead event posting infrastructure, ETW • Built prototype device driver provider & user-space consumer • Used packet traces for feasibility study on (client, server) • Peaks at (165, 5667) live and (39, 567) active flows per sec • Route collection • OSPF is link-state: passively collect link state adverts • Extension of my work at Sprint (for IS-IS and BGP); also been done at AT&T (NSDI’04 paper)

  12. The distributed database • Logically contains • Traffic flow matrix (bandwidths), {srcs}×{dsts} • …each entry annotated with current route from src to dst • N.B. src/dst might be e.g. (IP end-point, application) • Large dynamic data set suggests aggregation • Related work • { distributed, continuous query, temporal } databases • Sensor networks • Potential starting points: Astrolabe or SDIMS (SIGCOMM’04) • Where/what/how much to aggregate? • Is data read- or write-dominated? • Which is more dynamic, flow or topology data? • Can the system successfully self-tune?

  13. The distributed database • Construct traffic matrix from flow monitoring • Hosts can supply flows they source and sink • Only need a subset of this data to get complete traffic matrix • Construct topology from route collection • OSPF supplies topology → routes • Wish to be able to answer queries like • “Who are the top-10 traffic generators?” • Easy to aggregate, don’t care about topology • “What is the load on link l?” • Can aggregate from hosts, but need to know routes • “What happens if we remove links {l…m}?” • Interaction between traffic matrix, topology, even flow control

  14. The distributed database • Building simulation model • OSPF data gives topology, event list, routes • Simple load model to start with (load ~ # subnets) • Precedence matrix (from SPF) reduces flow-data query set • Can we do as well/better than e.g. NetFlow? • Accuracy/coverage trade-off • How should we distribute the DB? • Just OSPF data? Just flow data? A mixture? • How many levels of aggregation? • How many nodes do queries touch? • What sort of API is suitable? • Example queries for sample applications

  15. Research issues • Corner cases • Scalability • Robustness, accuracy • Control systems

  16. Research issues • Corner cases • Multi-homed hosts: how best to define a flow • L4 routeing, NAT, proxy ARP, transparent proxies • (Solve using device config files, perhaps SNMP) • Scalability • Host measurement must not be intrusive (in terms of packet latency, CPU load, network bandwidth) • Aggregators must elect themselves in such a way that they do not implode under event load • What happens if network radically alters? E.g. • Extensive use of multicast • Connection patterns shift due to e.g. P2P deployment

  17. Research issues • Robustness • Network management had better still work as nodes fail or the network partitions! • Accuracy in the face of late, partial information • By accident: unmonitored hosts • By design: aggregation, more detail about local area • Inference of link contribution to cumulative metrics, e.g. RTT • Network control: modify link weights • How efficient is the current configuration anyway? • What are plausible timescales to reconfigure?

  18. Summary • Aim to build a coherent edge-based network management platform using flow monitoring and standard routeing protocols • Applications include visualization, simulation, dynamic control • Research issues include • Scalability: want to manage a 300,000 node network • Robustness: must work as nodes fail or network partitions • Accuracy: will not be able to monitor 100% of traffic • Control systems: use the data to optimize the network in real-time, as well as just observe and simulate

  19. Current status • Submitted HotNets paper • Prototype ETW provider/consumer driver • Studied feasibility of flow monitoring • Prototype OSPF collector & topology reconstruction • Investigating “distributed database” via simulation • Query properties • System decomposition • Questions, comments?

  20. Backup slides • SNMP • Internet routeing • OSPF • BGP • Security

  21. SNMP • Protocol to manage information tables at devices • Provides get, set, trap, notify operations • get, set: read, write values • trap: signal a condition (e.g. threshold exceeded) • notify: reliable trap • Complexity mostly in the table design • Some standard tables, but many vendor specific • Non-critical, so often tables populated incorrectly

  22. Internet routeing • Q: how to get a packet from node to destination? • A1: advertise all reachable destinations and apply a consistent cost function (distance vector) • A2: learn network topology and compute consistent shortest paths (link state) • Each node (1) discovers and advertises adjacencies; (2) builds link state database; (3) computes shortest paths • A1, A2: Forward to next-hop using longest-prefix-match

  23. OSPF (~link state routeing) • Q: how to route given packet from any node to destination? • A: learn network topology; compute shortest paths • For each node • Discover adjacencies (~immediate neighbours); advertise • Build link state database (~network topology) • Compute shortest paths to all destination prefixes • Forward to next-hop using longest-prefix-match (~most specific route)

  24. BGP (~path vector routeing) • Q: how to route given packet from any node to destination? • A: neighbours tell you destinations they can reach; pick cheapest option • For each node • Receive (destination, cost, next-hop) for all destinations known to neighbour • Select among all possible next-hops for given destination • Advertise selected (destination, cost+, next-hop') for all known destinations • Selection process is complicated • Routes can be modified/hidden at all three stages • General mechanism for application of policy

  25. Security • Threat: malicious/compromised host • Authenticate participants • Must secure route collector as if a router • Threat: DoS on monitors • Difference between client under DoS and server? • Rate pace output from monitors • Threat: eavesdropping • Standard IPSec/encryption solutions

More Related