1 / 35

SecondSite: Disaster Tolerance as a Service

SecondSite: Disaster Tolerance as a Service. Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew W arfield. Failures in a Datacenter. Tolerating Failures in a Datacenter. REMUS. Initial idea behind Remus was to tolerate Datacenter level failures. Can A Whole Datacenter Fail ? . Yes!

niles
Download Presentation

SecondSite: Disaster Tolerance as a Service

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SecondSite: Disaster Tolerance as a Service Shriram Rajagopalan Brendan Cully Ryan O’Connor Andrew Warfield

  2. Failures in a Datacenter

  3. Tolerating Failures in a Datacenter REMUS Initial idea behind Remus was to tolerate Datacenter level failures.

  4. Can A Whole Datacenter Fail ? Yes! It’s a “Disaster”!

  5. Disasters “Truck driver in Texas kills all the websites you really use” …Southlake FD found that he had low blood sugar - valleywag.com Illustrative Image courtesy of TangoPango, Flickr. “Our Internet infrastructure, despite all the talk, is as fragile as a fine porcelain cup on the roof of a car zipping across a pot-holed goat track. A single truck driver can take out sites like 37Signals in a snap.” - Om Malik, GigaOM

  6. Disasters.. Water-main break cripples Dallas County computers, operations The county's criminal justice system nearly ground to a halt, as paper processing from another era led to lengthy delays - keeping some prisoners in jail longer than normal. - Dallas Morning News, Jun 2010

  7. Disasters..

  8. More Fodder Back Home • “ • An explosion… near our server bank … electrical box containing 580 fiber cables. • electrical box … was covered in asbestos… mandated the wearing of hazmat suits .... • Worse yet, the dynamic rerouting —which is the hallmark of the internet … did not function. • In other words, the perfect storm. Oh well. S*it happens. ’’ • -Dan Empfield, Slowswitch.com - a Gossamer Threads customer.

  9. Disaster Recovery – The Old Fashioned Way • Storage replication between a primary and backup site. • Manually restore physical servers from backup images. • Data Loss and Long Outage periods. • Expensive Hardware – Storage Arrays, Replicators, etc.

  10. Array Replication State of the Art Disaster Recovery X ProtectedSite RecoverySite Site RecoveryManager Site RecoveryManager VirtualCenter VirtualCenter VMs offline VMs online in Protected Site VMs powered on VMs become unavailable Datastore Groups Datastore Groups Source: VMWare Site Recovery Manager – Technical Overview

  11. Problems with Existing Solutions • Data Loss & Service Disruption • (RPO ~15min, RTO ~few hours) • Complicated Recovery Planning • (e.g. service A needs to be up before B, etc.) • Application Level Recovery • Bottom Line: Current State of DR is • Complicated • Expensive • Not suitable for a general purpose cloud-level offering.

  12. Disaster Tolerance as a Service ? Our Vision

  13. Overview • A Case for Commoditizing Disaster Tolerance • SecondSite – System Design • Evaluation & Experiences

  14. Primary & Backup Sites 5ms RTT

  15. Failover & Failback without Outage Primary Site: Vancouver Backup Site : Kamloops Primary Site: Vancouver Primary Site: Kamloops Primary Site: Kamloops Backup Site : Vancouver • Complete State Recovery (CPU, disk, memory, network) • No Application Level Recovery

  16. Main Contributions • Remus (NSDI ’08) • Checkpoint based State Replication • Fully Transparent HA • Recovery Consistency • No Application level recovery • RemusDB (VLDB’11) • Optimize Server Latency • Reduce Replication Bandwidth by up to 80% using • Page Delta Compression • Disk Read Tracking • SecondSite (VEE’12) • Failover Arbitration in Wide Area • Stateful Network Failover over Wide Area

  17. Contributions..

  18. Failure Detection in Remus External Network • A pair of independent dedicated NICs carry replication traffic. • Backup declares Primary failure only if • It cannot reach Primary via NIC 1 and NIC2 • It can reach External N/W via NIC1 • Failure of Replication link alone results in Backup shutdown. • Split Brain occurs only when both NICs/links fail. LAN Primary Backup NIC1 NIC1 Checkpoints NIC2 NIC2

  19. Failure Detection in Wide Area Deployments INTERNET External Network WAN • Cannot distinguish between link and node failure. • Higher chances of Split Brainas the network is not reliable anymore LAN Replication Channel Primary Datacenter Backup Datacenter Primary Backup NIC1 NIC1 Checkpoints NIC2 NIC2

  20. Failover Arbitration • Local Quorum of Simple Reachability Detectors. • Stewards can be placed on third party clouds. • Google App Server implementation with ~100 LoC. • Provider/User could have other sophisticated implementations.

  21. Failover Arbitration.. Stewards 5 1 4 2 3 POLL 5 POLL 4 X POLL 3 POLL 2 X POLL 1 X POLL 4 POLL 2 POLL 3 X X POLL 5 POLL 1 Apriori Steward Set Agreement Backup Primary Quorum Logic Quorum Logic I need majority to stay alive I need exclusive majority to failover Replication Stream

  22. Network Failover without Service Interruption • Remus – LAN - Gratuitous ARP from Backup Host • SecondSite – WAN/Internet – BGP Route Update from Backup Datacenter • Need support from upstream ISP(s) at both Datacenters • IP Migration achieved through BGP Multi-homing

  23. Network Failover without Service Interruption.. Internet • BGP Multi-homing • Replication • Routing traffic to Primary Site • Re-routing traffic to Backup Site on Failover BCNet (AS-271) Kamloops (207.23.255.237) Vancouver (134.87.2.173) as-path prepend 64678 as-path prepend 64678 64678 64678 64678 as-path prepend 64678 64678 AS-64678 (stub) (134.87.3.0/24) 134.87.2.174 207.23.255.238 AS-64678 (stub) (134.87.3.0/24) VMs VMs VMs Primary Site Backup Site

  24. Overview • A Case for Commoditizing Disaster Tolerance • SecondSite – System Design • Evaluation & Experiences

  25. Evaluation Failover Works!! I want periodic failovers with no downtime! More than one failure ? I will have to restart HA! Did you run regression tests ?

  26. Restarting HA • Need to Resynchronize Storage. • Avoiding Service Downtime requires Online Resynchronization • Leverage DRBD –only resynchronizes blocks that have changed • Integrate DRBD with Remus • Add checkpoint based asynchronous disk replication protocol.

  27. Regression Tests • Synthetic Workloads to stress test the Replication Pipeline • Failovers every 90 minutes • Discovered some interesting corner cases • Page-table corruptions in memory checkpoints • Write-after-write I/O ordering in disk replication

  28. SecondSite – The Complete Picture 4 VMs x 100 Clients/VM • Service Downtime includes timeout for failure detection (10s) • Failure Detection Timeout is configurable

  29. Replication Bandwidth Consumption 4 VMs x 100 Clients/VM

  30. Demo • Expect a real disaster (conference demos are not a good idea!)

  31. Application Throughput vs. Replication Latency Kamloops SPECWeb w/ 100 Clients

  32. Resource Utilization vs. Application Load Domain-0 CPU Utilization Bandwidth usage on Replication Channel Cost of HA as a function of Application Load (OLTP w/ 100 Clients)

  33. Resynchronization Delays vs. Outage Period OLTP Workload

  34. Setup Workflow – Recovery Site • The user creates a recovery plan which is associated to a single or multiple protection groups Source: VMWare Site Recovery Manager – Technical Overview

  35. Recovery Plan VM Shutdown High Priority VM Shutdown Prepare Storage High Priority VM Recovery Normal Priority VM Recovery Low Priority VM Recovery Source: VMWare Site Recovery Manager – Technical Overview

More Related