1 / 64

Disaster Recovery by Stretching Hyper-V Clusters Across Sites

Required Slide. SESSION CODE: VIR303. Disaster Recovery by Stretching Hyper-V Clusters Across Sites. Symon Perriman Program Manager II Clustering & High-Availability Microsoft Corporation. Session Objectives And Takeaways. Session Objective(s):

lenci
Download Presentation

Disaster Recovery by Stretching Hyper-V Clusters Across Sites

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Required Slide SESSION CODE: VIR303 Disaster Recovery by Stretching Hyper-V Clusters Across Sites Symon Perriman Program Manager II Clustering & High-Availability Microsoft Corporation

  2. Session Objectives And Takeaways • Session Objective(s): • Understanding the need and benefit of multi-site clusters • What to consider as you plan, design, and deploy your first multi-site cluster • Windows Server Failover Clustering with Hyper-V is a great solution for not only high availability, but also disaster recovery

  3. Multi-Site Clustering Introduction Networking Storage Quorum

  4. Defining High-Availability But what if there is a catastrophic event and you lose the entire datacenter? High-Availability (HA)allows applications or VMs to maintain service availability by moving them between nodes in a cluster Site A

  5. Defining Disaster Recovery • Disaster Recovery (DR) allows applications or VMs to maintain service availability by moving them to a cluster node in a different physical location Node is located at a physically separate site Site A Site B Site B SAN

  6. Benefits of a Multi-Site Cluster • Protects against loss of an entire location • Power Outage, Fires, Hurricanes, Floods, Earthquakes, Terrorism • Automates failover • Reduced downtime • Lower complexity disaster recovery plan • Reduces administrative overhead • Automatically synchronize application and cluster changes • Easier to keep consistent than standalone servers • What is the primary reason why DR solutions fail? • Dependence on People

  7. Flexible Hardware • Two simple requirements for support • All components must be logoed • http://www.microsoft.com/whdc/winlogo/default.mspx • Complete solution must pass the Cluster Validation Test • http://technet.microsoft.com/en-us/library/cc732035.aspx • Same 2008 hardware will work • No reason to not move to R2! • CSV has same storage requirements • iSCSI, Fibre Channel or Serial-Attached SCSI • Support Policy: KB 943984

  8. Multi-Site Clustering Introduction Networking Storage Quorum

  9. Stretching the Network • Longer distance traditionally means greater network latency • Missed inter-node health checks can cause false failover • Cluster heartbeating is fully configurable • SameSubnetDelay (default = 1 second) • Frequency heartbeats are sent • SameSubnetThreshold (default = 5 heartbeats) • Missed heartbeats before an interface is considered down • CrossSubnetDelay (default = 1 second) • Frequency heartbeats are sent to nodes on dissimilar subnets • CrossSubnetThreshold (default = 5 heartbeats) • Missed heartbeats before an interface is considered down to nodes on dissimilar subnets • Command Line: Cluster.exe /prop • PowerShell (R2): Get-Cluster | fl *

  10. Security over the WAN • Encrypt inter-node communication • 0 = clear text • 1 = signed (default) • 2 = encrypted Site A Site B 20.20.20.1 10.10.10.1 40.40.40.1 30.30.30.1

  11. Network Considerations • Network Deployment Options: • Stretch VLANs across sites • Cluster nodes can reside in different subnets Public Network Site A Site B 20.20.20.1 10.10.10.1 40.40.40.1 30.30.30.1 Redundant Network

  12. DNS Considerations • Nodes in dissimilar subnets • VM obtains new IP address • Clients need that new IP Address from DNS to reconnect DNS Server 2 DNS Replication DNS Server 1 Record Created Record Updated Record Obtained Record Updated 10.10.10.111 20.20.20.222 VM = 10.10.10.111 VM = 20.20.20.222 Site B Site A

  13. Faster Failover for Multi-Subnet Clusters • RegisterAllProvidersIP (default = 0 for FALSE) • Determines if all IP Addresses for a Network Name will be registered by DNS • TRUE (1): IP Addresses can be online or offline and will still be registered • Ensure application is set to try all IP Addresses, so clients can come online quicker • HostRecordTTL (default = 1200 seconds) • Controls time the DNS record lives on client for a cluster network name • Shorter TTL: DNS records for clients updated sooner • Exchange Server 2007 recommends a value of five minutes (300 seconds)

  14. Solution #1: Local Failover First • Configure local failover fist for high availability • No change in IP addresses • No DNS replication issues • No data going over the WAN • Cross-site failover for disaster recovery DNS Server 1 10.10.10.111 20.20.20.222 VM = 10.10.10.111 Site A Site B

  15. Solution #2: Stretch VLANs • Deploying a VLAN minimizes client reconnection times • IP of the VM never changes DNS Server 2 DNS Server 1 10.10.10.111 VLAN FS = 10.10.10.111 Site A Site B

  16. Solution #3: Abstraction in Networking Device • Networking device uses independent 3rd IP Address • 3rd IP Address is registered in DNS & used by client DNS Server 2 30.30.30.30 DNS Server 1 10.10.10.111 20.20.20.222 VM = 30.30.30.30 Site A Site B

  17. Cluster Shared Volumes Networking Considerations • CSV does not support having nodes in dissimilar subnets • Use VLANs if you want to use CSV with multi-site clusters Note: CSV and live migration are independent, but complimentary, technologies VLAN CSV Network Site B Site A

  18. Updating VMs IP Address on Cross-Subnet Failover • On cross-subnet failover, if guest is… Best to use DHCP in guest OS for cross-subnet failover

  19. Live Migrating Across Sites • Live migration moves a running VM between cluster nodes • TCP reconnects makes the move unnoticeable to clients • Use VLANs to achieve live migrations between sites • IP client is connected to will not change • Network Bandwidth Planning • Live migration may require significant network bandwidth based on amount of memory allocated to VM • LM times will be longer with high latency or low bandwidth WAN connections

  20. Multi-Subnet vs. VLAN Recap • Choosing the right networking model for you depends on your business requirements

  21. Multi-Site Clustering Introduction Networking Storage Quorum

  22. Storage in Multi-Site Clusters • Different than local clusters: • Multiple storage arrays – independent per site • Nodes commonly access own site storage • No ‘true’ shared disk visible to all nodes Site A Site B Site B SAN

  23. Storage Considerations Site A Site B Site B Site A Replica Changes are made on Site A and replicated to Site B SAN • DR requires data replication mechanism between sites

  24. Replication Partners • Hardware storage-based replication • Block-level replication • Software host-based replication • File-level replication • Appliance replication • File-level replication

  25. Synchronous Replication • Host receives “write complete” response from the storage after the data is successfully written on both storage devices Replication WriteRequest SecondaryStorage WriteComplete PrimaryStorage Acknowledgement

  26. Asynchronous Replication • Host receives “write complete” response from the storage after the data is successfully written to just the primary storage device, then replication Replication WriteRequest SecondaryStorage WriteComplete PrimaryStorage

  27. Synchronous versus Asynchronous

  28. Cluster Validation with Replicated Storage • Multi-Site clusters are not required to pass the Storage tests to be supported • Validation Guide and Policy • http://go.microsoft.com/fwlink/?LinkID=119949

  29. What about DFS-Replication? • Not supported to use the file server DFS-R feature to replicate VM data on a multi-site Failover Cluster • DFS-R performs replication on file close: • Works well for Office documents • Not designed for application workloads where the file is held open, like VHDs or databases

  30. Cluster Shared Volume Overview • Cluster Shared Volumes (CSV) • Distributed file access solution for Hyper-V • Enabling multiple nodes to concurrently access a single ‘truly’ shared volume • Provides VMs complete transparency with respect to which nodes actually own a LUN • Guest VMs can be moved without requiring any disk ownership changes • No dismounting and remounting of volumes is required Concurrent access to a single file system Single Volume SAN Disk5 VHD VHD VHD

  31. CSV with Replicated Storage • Traditional architectural assumptions do not hold true • Traditional replication solutions assume only 1 array accessed at a time • CSV assumes all nodes can concurrently access a LUN • CSV is supported by many replication vendors • Talk to your storage to understand their support story Site B Site A VM attempts to access replica VHD Read/Write Read/Only

  32. Storage Virtualization Abstraction • Some replication solutions provide complete abstraction in storage array • Servers are unaware of accessible disk location • Fully compatible with Cluster Shared Volumes (CSV) Site B Site A Servers abstracted from storage Virtualized storage presents logical LUN

  33. Choosing a Multi-Site Storage Model • Choosing the right storage model for you depends on your business requirements

  34. EMC for Windows Server Failover Clustering Txomin BarturenSenior Manager Symmetrix and VirtualizationEMC Corporation PARTNER

  35. What’s Storage Got To Do With It? • Storage Controllers can be powerful compute and replication resources • Provide multiple forms of replication styles • Synchronous – Metro configurations • Asynchronous – Continental configurations… and various combinations of those • Arrays/Appliances are able to provide Consistency Technology to replication • Bind database and transaction logs together as an atomic unit • Required for Disaster Recovery scenarios • Single consolidated solution for all environments • As opposed to per-application solution • Operational ease and automated operations

  36. Geographical Windows Clustering • Long history of Geographical Windows solutions • Original “GeoSpan” introduced in the 1990s • Current product is called “Cluster Enabler” • Support for multiple storage replication mechanisms • Symmetrix Remote Data Facility (SRDF) • CLARiiON Mirrorview • EMC RecoverPoint (Appliance) • Support for multiple replication implementations • Synchronous (SRDF/S, MV/S, RP) • Asynchronous (SRDF/A, MV/A, RP) • Select the best replication fit for SLA

  37. Cluster Enabler – Integration with Failover Clustering • Cluster Enabler is implemented as a cluster group resource • DLL manages disk state when necessary • Disaster or site move requests • Custom MMC for administration • Provides insight into relationships • Allows for management of storage resources • Add/remove storage devices • All cluster functions managed through Failover Cluster Manager • Simplified management

  38. Unique Cluster Configuration Support Cascaded Replication Concurrent Replication Heterogeneous Replication

  39. Challenges of Block Storage Replication • Storage block level replication is typically uni-directional (per LUN) • Change blocks flow from source site to remote • Possible to have different LUNs replicating in different directions • Storage cannot enforce block level collision resolution • Application must determine resolution, or be coordinated • Applications today implement shared nothing model • Surfacing storage as R/W at multiple sites is only useful if application can handle a distributed access device • Few applications implement the necessary support • Obvious exception is CSV

  40. EMC VPLEX METRO support for Hyper-V and Cluster Shared Volumes ANNOUNCING

  41. Federated Storage Infrastructure • Federated storage • A new HW and SW platform that extends storage beyond the boundaries of the data center • Located in the SAN to present hosts with federated view of EMC and heterogeneous storage • VPLEX Local and VPLEX Metro configurations • Unique Value • Distributed coherent cache – AccessAnywhere™ • N+1 scale out cluster • Data at a Distance • Architected for Global Apps • Workload “travels” with application

  42. CSV - Volume1 - SQL VHDs CSV - Volume1 - OS VHDs CSV - Volume2 - SQL VHDs CSV - Volume2 - OS VHDs CSV - Volume3 - SQL VHDs CSV - Volume3 - OS VHDs CSV - Volume4 - SQL VHDs CSV - Volume4 - OS VHDs Sample VPLEX METRO Configuration NewYork-01 NewJersey-01 NewJersey-02 NewYork-02 NewJersey-03 NewYork-03 NewJersey-04 NewYork-04 VPLEX Cluster-2 VPLEX Cluster-1

  43. EMC VPLEX Metro with Cluster Shared Volumes DEMO

  44. Multi-Site Clustering Introduction Networking Storage Quorum

  45. Quorum Overview • Disk only (not recommended) • Node and Disk majority • Node majority • Node and File Share majority • Majority is greater than 50% • Possible Voters: • Nodes (1 each) + 1 Witness (Disk or File Share) • 4 Quorum Types Vote Vote Vote Vote Vote

  46. Replicated Disk Witness • A witness is a tie breaker when nodes lose network connectivity • The witness disk must be a single decision maker, or problems can occur • Do not use a Disk Witness in multi-site clusters unless directed by vendor Vote Vote Vote ? Replicated Storage

  47. Node Majority Can I communicate with majority of the nodes in the cluster? Yes, then Stay Up Can I communicate with majority of the nodes in the cluster? No, drop out of Cluster Membership 5 Node Cluster: Majority = 3 Site B Site A Cross site network connectivity broken! Majority in Primary Site

  48. Node Majority We are down! Can I communicate with majority of the nodes in the cluster? No, drop out of Cluster Membership 5 Node Cluster: Majority = 3 Site B Site A Need to force quorum manually Disaster at Site 1 Majority in Primary Site

  49. Forcing Quorum • Forcing quorum is a way to manually override and start a node even if the cluster does not have quorum • Important: understand why quorum was lost • Cluster starts in a special “forced” state • Once majority achieved, drops out of “forced” state • Command Line: • net start clussvc /fixquorum (or /fq) • PowerShell (R2): • Start-ClusterNode –FixQuorum (or –fq)

  50. Multi-Site with File Share Witness File Share Witness Site C (branch office) Complete resiliency and automatic recovery from the loss of any 1 site \\Foo\Share WAN Site A Site B

More Related