1 / 50

Towards Reliable Cloud Storage

Towards Reliable Cloud Storage. Haryadi Gunawi UC Berkeley. Outline. Research background FATE and DESTINI. Yesterday’s storage. Storage Servers. Local storage. Tomorrow’s storage. e.g. Google Laptop. Cloud Storage. Tomorrow’s storage. Internet-service FS:

chione
Download Presentation

Towards Reliable Cloud Storage

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Towards ReliableCloud Storage Haryadi Gunawi UC Berkeley

  2. Outline • Research background • FATE and DESTINI

  3. Yesterday’s storage Storage Servers Local storage

  4. Tomorrow’s storage e.g. Google Laptop Cloud Storage

  5. Tomorrow’s storage Internet-service FS: GoogleFS, HadoopFS, CloudStore, ... Key-Value Store: Cassandra, Voldemort, ... + Replication + Scale-up + Migration + ... Reliable? Structured Storage: Yahoo! PNUTS, Google BigTable, Hbase, ... “This is not just data. It’s my life. And I would be sick if I lost it” [CNN ’10] Highly-available? Custom Storage: Facebook Haystack Photo Store, Microsoft StarTrack(Map Apps), Amazon S3, EBS, ...

  6. Headlines Cloudy with a chance of failure

  7. Outline • Research background • FATE and DESTINI • Motivation • FATE • DESTINI • Evaluation Joint work with: Thanh Do, Pallavi Joshi, Peter Alvaro, Joseph Hellerstein, Andrea Arpaci-Dusseau, RemziArpaci-Dusseau, Koushik Sen, Dhruba Borthakur

  8. Failure recovery ... • Cloud • Thousands of commodity machines • “Rare (HW) failures become frequent” [Hamilton] • Failure recovery • “… has to come from the software” [Dean] • “… must be a first-class op” [Ramakrishnan et al.]

  9. ... hard to get right • Google Chubby (lock service) • Four occasions of data loss • Whole-system down • More problems after injecting multiple failures (randomly) • Google BigTable (key-value store) • Chubby bug affects BigTable availability • More details? • How often? Other problems and implications?

  10. ... hard to get right • HDFS JIRA study • 1300 issues, 4 years (April 2006 to July 2010) • Select recovery problems due to hardware failures • 91 recovery bugs/issues • Open-source cloud projects • Very valuable bug/issue repository! • HDFS (1400+), ZooKeeper (900+), Cassandra (1600+), Hbase (3000+), Hadoop (7000+)

  11. Why? • Testing is not advanced enough [Google] • Failure model: multiple, diverse failures • Recovery is under-specified [Hamilton] • Lots of custom recovery • Implementation is complex • Need two advancements: • Exercise complex failure modes • Write specifications and test the implementation

  12. Cloud testing FATE Cloud software X2 X1 Failure Testing Service DESTINI Declarative Testing Specifications Violate specs?

  13. Outline • Research background • FATE and DESTINI • Motivation • FATE • Architecture • Failure exploration • DESTINI • Evaluation

  14. Setup Stage M C 1 2 3 4 M C 1 2 3 HadoopFS (HDFS) Write Protocol Alloc Req Data Transfer X1 X2 No failures M C 1 2 3 Data Transfer Recovery: Continue on surviving nodes (1, 2) Setup Recovery: Recreate fresh pipeline (1, 2, 4)

  15. Failures and FATE M C 1 2 3 4 • Failures • Anytime: different stages  different recovery • Anywhere: N2 crash, and then N3 • Any type: bad disks, partitioned nodes/racks • FATE • Systematically exercise multiple, diverse failures • How? need to “remember” failures – via failure IDs M C 1 2 3

  16. Failure IDs • Abstraction of I/O failures • Building failure IDs • Intercept every I/O • Inject possible failures • Ex: crash, network partition, disk failure (LSE/corruption) Node2 Node3 X Note: FIDs A, B, C, ...

  17. FATE Architecture Workload Driver while (new FIDs) { hdfs.write() } Target system (e.g. Hadoop FS) I/O Info Failure Server (fail/no fail?) AspectJ Failure Surface Java SDK

  18. A A A A A A B B B B B C C C Brute-force exploration 1 failure / run 2 failures / run M C 1 2 3 M C 1 2 3 Exp #1: A AB Exp #2: B AC Exp #3: C BC

  19. Outline • Research background • FATE and DESTINI • Motivation • FATE • Architecture • Failure exploration challenge and solution • DESTINI • Evaluation

  20. Combinatorial explosion • Exercised over 40,000 unique combinations of 1, 2, and 3 failures per run • 80 hours of testing time! • New challenge: Combinatorial explosion • Need smarter exploration strategies A1 B1 A2 B2 A3 B3 1 2 3 2 failures / run A1 A2 A1 B2 B1 A2 B1 B2 ...

  21. Pruning multiple failures • Properties of multiple failures • Pairwise dependent failures • Pairwise independent failures • Goal: exercise distinctrecovery behaviors • Key: some failures result in similar recovery • Result: > 10x faster, and found the same bugs

  22. Dependent failures • Failure dependency graph • Inject single failures first • Record subsequent dependent IDs • Ex: X depends on A • Brute-force: AX, BX, CX, DX, CY, DY • Recovery clustering • Two clusters: {X} and {X, Y} • Only exercise distinct clusters • Pick a failureID that triggers the recovery cluster • Results: AX,CX,CY FID  Subseq FIDs A  X B  X C  X, Y D  X, Y A B C D X Y

  23. Independent failures (1) • Independent combinations • FP2 x N (N – 1) • FP = 2, N = 3, Tot = 24 • Symmetric code • Just pick two nodes • N (N – 1)  2 • FP2 x 2 1 2 3 B3 A3 A1 B1 A2 B2 B3 A1 B1 A2 B2 A3 1 2 3

  24. Independent failures (2) • FP2bottleneck • FP = 4, total = 16 • Real example, FP = 15 • Recovery clustering • Cluster A and B if: fail(A) == fail(B) • Reduce FP2to FP2clustered • E.g.15 FPs to 8 FPsclustered A1 B1 A2 B2 C1 D1 C2 D2 A1 B1 A2 B2 C1 D1 C2 D2

  25. FATE Summary • Exercise multiple, diverse failures • Via failure IDs abstraction • Challenge: combinatorial explosion of multiple failures • Smart failure exploration strategies • > 10x improvement • Built on top of failure IDs abstraction • Current limitations • No I/O reordering • Manual workload setup

  26. Outline • Research background • FATE and DESTINI • Motivation • FATE • DESTINI • Overview • Building specifications • Evaluation

  27. DESTINI: declarative specs • Is the system correct under failures? • Need to write specifications X1 X2 Specs [It is] great to document (in a spec) the HDFS write protocol ... …, but weshouldn't spend toomuch time on it, … aformal spec may be overkill for a protocol we plan to deprecate imminently. Implemen- tation

  28. Declarative specs • How to write specifications? • Developer friendly (clear, concise, easy) • Existing approaches: • Unit test (ugly, bloated, not formal) • Others too verbose and long • Declarative relational logic language (Datalog) • Key: easy to express logical relations

  29. Specs in DESTINI • How to write specs? • Violations • Expectations • Facts • How to write recovery specs? • “... recovery is under specified” [Hamilton] • Precise failure events • Precise check timings • How to test implementation? • Interpose I/O calls (lightweight) • Deduce expectations and facts from I/O events Specs Implemen- tation

  30. Outline • Research background • FATE and DESTINI • Motivation • FATE • DESTINI • Overview • Building specifications • Evaluation

  31. Specification template “Throw a violation if an expectation is different from the actual behavior” violationTable(…) :- expectationTable(…), NOT-IN actualTable(…) head() :- predicates(), … :- derivation , AND Datalog syntax:

  32. Data transfer recovery M C 1 2 3 “Replicas should exist in surviving nodes” X Data Transfer B B incorrectNodes(B, N) :- expectedNodes(B, N), NOT-IN actualNodes(B, N);

  33. Recovery bug X B B M C 1 2 3 incorrectNodes(B, N) :- expectedNodes(B, N), NOT-IN actualNodes(B, N);

  34. Building expectations 1 2 3 M C • Ex: which nodes should have the blocks? • Deduce expectations from I/O events (italic) X2 C M getBlockPipe(…) Give me 3 nodes for B [Node1, Node2, Node3] expectedNodes (B, N) :- getBlockPipe(B, N); #1: incorrectNodes(B, N) :- expectedNodes(B, N), NOT-IN actualNodes(B, N);

  35. Updating expectations M C 1 2 3 DEL expectedNodes (B, N) :- expectedNodes (B, N), fateCrashNode(N) X B B #1: incorrectNodes(B, N) :- expectedNodes(B, N), NOT-IN actualNodes(B, N); #2: expectedNodes(B, N) :- getBlockPipe(B,N)

  36. Precise failure events M C 1 2 3 4 DEL expectedNodes(B, N) :- expectedNodes (B, N), fateCrashNode(N), writeStage(B, Stage), Stage == “Data Transfer”; Different stages  different recovery behaviors Precise failure events  Data transfer recovery vs. Setup stage recovery M C 1 2 3 #1: incorrectNodes(B,N) :- expectedNodes(B,N), NOT-IN actualNodes(B,N) #2: expectedNodes(B,N) :- getBlockPipe(B,N) #3: expectedNodes(B,N) :- expectedNodes(B,N), fateCrashNode(N), writeStage(B,Stage), Stage == “DataTransfer” #4: writeStage(B, “DataTr”) :- writeStage(B,“Setup”), nodesCnt(Nc), acksCnt (Ac), Nc==Ac #5: nodesCnt (B, CNT<N>) :- pipeNodes (B, N); #6: pipeNodes (B, N) :- getBlockPipe(B, N); #7: acksCnt (B, CNT<A>) :- setupAcks (B, P, “OK”); #8: setupAcks (B, P, A) :- setupAck(B, P, A);

  37. Facts • Ex: which nodes actually store the valid blocks? • Deduced from disk I/O events in 8 datalog rules X M C 1 2 3 B #1: incorrectNodes(B,N) :- expectedNodes(B,N), NOT-IN actualNodes(B,N)

  38. Violation and check-timing • Recovery ≠ invariant • If recovery is ongoing, invariants are violated • Don’t want false alarms • Need precisecheck timings • Ex: upon block completion #1: incorrectNodes(B, N) :- expectedNodes(B, N), NOT-IN actualNodes(B, N), completeBlock(B);

  39. Data-transfer recovery spec Failure event (crash) Expectation Actual Facts I/O Events

  40. More detailed specs M C 1 2 3 M C 1 2 3 • The spec: “something is wrong” • Why? Where is the bug? • Let’s write more detailed specs X X B B B

  41. Adding specs M C 1 2 3 X • First analysis • Client’s pipeline excludes Node2, why? • Maybe, client gets a bad ack for Node2 errBadAck(N) :- dataAck (N, “Error”), liveNodes (N) • Second analysis • Client gets a bad ack for Node2, why? • Maybe, Node1 could not communicate to Node2 errBadConnect(N, TgtN) :- dataTransfer (N, TgtN, “Terminated”), liveNodes (TgtN) • We catch the bug! • Node2 cannot talk to Node3 (crash) • Node2 terminates all connections (including Node1!) • Node1 thinks Node2 is dead B

  42. Catch bugs with more specs • More detailed specs • Catch bugs closer to the source and earlier in time Datanode’s view Client’s view Global view

  43. More ... • Design patterns • Add detailed specs • Refine existing specs • Write specs from different views (global, client, dn) • Incorporate diverse failures (crashes, net partitions) • Express different violations (data-loss, unavailability)

  44. Outline • Research background • FATE and DESTINI • Motivation • FATE • DESTINI • Evaluation

  45. Evaluation • Implementation complexity • ~6000 LOC in Java • Target 3 popular cloud systems • HadoopFS (primary target) • Underlying storage for Hadoop/MapReduce • ZooKeeper • Distributed synchronization service • Cassandra • Distributed key-value store • Recovery bugs • Found 22 new HDFS bugs (confirmed) • Data loss, unavailability bugs • Reproduced 51 old bugs

  46. Availability bug “Ifmultiple racks are available (reachable), a block should be stored in a minimum of two racks” “Throw a violation if a block is only stored in one rack, but the rack is connected to another rack” errorSingleRack(B) :- rackCnt(B,Cnt), Cnt==1, blkRacks(B,R), connected(R,Rb), endOfReplicationMonitor (_);  rackCnt B, 1 blkRacks B, R1 connected R1, R2 errorSingleRack B Rack #1 Rack #2 Availability bug! #replicas = 3, locations are not checked, B is not migrated to R2 B B Client B B FATE injects rack partitioning Replication Monitor

  47. Pruning Efficiency 7720 • Reduce #experiments by an order of magnitude • Each experiment = 4-9 seconds • Found the same number of bugs • (by experience) 5000 # Exps Brute Force Pruned 618 Write + 2 crashes Append + 2 crashes Write + 3 crashes Append + 3 crashes

  48. Specification simplicity • Compared to other related work

  49. Summary • Cloud systems • All good, but must manage failures • Performance, reliability, availability depend on failure recovery • FATE and DESTINI • Explore multiple, diverse failures systematically • Facilitate declarative recovery specifications • A unified framework • Real-world adoption in progress

  50. END Research background FATE and DESTINI Thanks! Questions?

More Related