1 / 15

ABC Co. Network Implementation

ABC Co. Network Implementation. High reliability is primary concern near 100% uptime required Customer SLA has stiff penalty clauses Everything is designed in a redundant fashion Network redundancy not integrated with system design or application design.

oleg
Download Presentation

ABC Co. Network Implementation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ABC Co. Network Implementation • High reliability is primary concern • near 100% uptime required • Customer SLA has stiff penalty clauses • Everything is designed in a redundant fashion • Network redundancy not integrated with system design or application design. • Application and system design not integrated • Management added last (to fix problems)

  2. The challenge is always politics • Politics prevents different parts of the company from working together. • Networking, Systems, and Applications are three different groups. • Systems group own the management issues. • Some requirements get in the way: • e.g. Management station must keep its data on the database server.

  3. Network design • “Dual Everything” is the design rule • Dual Routers/hubs (Cisco 5500’s) • Dual Ethernet • Dual attached systems

  4. A simple picture Redundant net to customers Rtr/Hub Rtr/Hub Dual rail Ethernet Server a Server n TNG DNS Wins

  5. More detail • No actual “Ethernet bus” • Systems connect to 5500 via UTP • Each system connects to both 5500’s • one connection is to “primary” LAN, other to secondary LAN • Half have “left” 5500 as primary, other have “right” as primary. • 5500s run OSPF and “router cluster” software

  6. Problems... • Server OS (NT and Unix) do not switch off the primary interface if it fails and will keep trying to use it. Applications hang and connections time out. • DNS points only to one interface on each server. • No automatic failover built into applications.

  7. Management software must: • Detect NIC failures • Continue to monitor system agents in presence of network failures • Correct server routing tables if primary interface fails (or the hub fails) • Update DNS • Notify operations as required.

  8. Challenges • Get each system to report all status via both NICs. • Monitor system over both NICs. • Prevent duplicate notifications. • Fail over as fast as possible. • Show connectivity of each system to both networks.

  9. What needs to be done to do this? • Modify auto discovery scripts to add each system twice as independent systems. • Requires private host file for name/address translation (cannot depend on access to DNS) • Invent system to recognize which interface is “active” and block those from other Nic(s)

  10. More work... • Duplicate any information in Object Repository that is needed to manage failover onto local system (cannot trust access to SQL server) • Store current connectivity state for all servers (added ILPs to class definitions).

  11. Tricks used • Each system name in messages has code added to end to indicate interface address: (-p or -s) • Most of the work is done in event message processing. • Each “raw” message is suppressed and a script evoked to process it. • Ping success/failures used to switch state • Agent messages dropped base on state and p/s flag

  12. Basic set of flows • For each event, (other than pings) • If mode is P or S (kept in NT Registry), and message is from S or P, discard. • Else, reformat message with real server name, improve content (system class, etc.) and send back to event console as a new message

  13. More Flow • For each Ping Success/Fail reported: • Remember DSM has already done the retries • If failure, check to see if other port fails, too. If the other port is dead, too, then declare the node down, and reset state to primary. • If its primary, the do failover to secondary. If secondary, do a “failure” back to primary. • Update DNS in all cases.

  14. Router / Hub failure • If the router/hub fails, invoke the primary failover script for each node connected to the primary side, and the secondary failover script for each node connected to the secondary side. • This is effectively all the nodes, so we don’t have to wait for each to have a ping failure. The system will stabilize faster.

  15. Does it work? • You bet! It required: • Some special REXX scripts for failover • A few Basic programs • A hack to the auto discovery scripts. • Some magic with Trix and a few more basic programs.

More Related