1 / 16

a shared log design for flash clusters

a shared log design for flash clusters. Mahesh Balakrishnan, Dahlia Malkhi Vijayan Prabhakaran, Ted Wobber John D. Davis, Michael Wei Microsoft Research Silicon Valley. tape is dead disk is tape flash is disk RAM locality is king - Jim Gray, Dec 2006. flash in the data center.

cooper
Download Presentation

a shared log design for flash clusters

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. a shared log design for flash clusters Mahesh Balakrishnan, Dahlia Malkhi Vijayan Prabhakaran, Ted Wobber John D. Davis, Michael Wei Microsoft Research Silicon Valley

  2. tape is dead disk is tape flash is disk RAM localityis king - Jim Gray, Dec 2006

  3. flash in the data center can flash clusters eliminate the trade-off between consistency and performance? what new abstractions are required to manage and access flash clusters?

  4. the CORFU abstraction: a shared log 20K/s 200K/s 500K/s example application: the Hyder database (Bernstein et al., CIDR 2011) infrastructure applications: SMR databases key-value stores filesystems virtual disks application append(value) read(offset) 200K/s 500K/s CORFU read from anywhere append to tail flash cluster

  5. the CORFU hardware: network flash • network-attached flash units • low power: 15W per unit • low latency • low cost cost + power usage of a 1 TB, 10 Gbps flash farm:

  6. problem statement how do we implement a scalable shared log over a cluster of network-attached flash units?

  7. the CORFU design CORFU API: V = read(O) O = append(V) trim(O) application mapping resides at the client CORFU library read from anywhere append to tail 4KB entry each logical entry is mapped to a replica set of physical flash pages

  8. the CORFU protocol: reads client application read(pos) D1 D3 D5 D7 CORFU library D2 D4 D6 D8 read(D1/D2, page#) Projection: D1 D2 D3 D4 D5 D6 D7 D8 CORFU cluster

  9. the CORFU protocol: appends client CORFU append throughput: # of 64-bit tokens issued per second sequencer is only an optimization! clients can probe for tail or reconstruct it from flash units reserve next position in log (e.g., 100) sequencer (T0) application read(pos) append(val) D1 D3 D5 D7 CORFU library D2 D4 D6 D8 write(D1/D2, val) Projection: D1 D2 D3 D4 D5 D6 D7 D8 CORFU cluster

  10. chain replication in CORFU client C1 2 client C2 client C3 1 safety under contention: if multiple clients try to write to same log position concurrently, only one wins writes to already written pages => error durability: data is only visible to reads if entire chain has seen it reads on unwritten pages => error requires `write-once’ semantics from flash unit

  11. handling failures: flash units each Projection is a list of views 0 - D1 D2 D3 D4 D5 D6 D7 D8 0 - 7 D1 a D3 D4 D5 D6 D7 D8 8 - D1 D9 D3 D4 D5 D6 D7 D8 0 - 7 D1 a D3 D4 D5 D6 D7 D8 8 – 9 D1 D9 D3 D4 D5 D6 D7 D8 9 - D10 D11 D12 D13 D14 D15 D16 D17 Projection 0 Projection 1 Projection 2 reconfiguration steps: ‘seal’ current projection at flash units write new projection at auxiliary D10 D12 D14 D16 D1 D3 D5 D7 latency for 32-drive cluster: tens of milliseconds D9 D2 D4 D6 D8 D11 D13 D15 D17

  12. handling failures: clients • client obtains token from sequencer and crashes:holes in the log • solution: other clients can fill the hole • fast CORFU fill operation (<1ms) ‘walks the chain’: • completes half-written entries • writes junk on unwritten entries (metadata operation, conserves flash cycles, bandwidth)

  13. garbage collection: two models • prefix trim(O): invalidate all entries before offset O • entry trim(O): invalidate only entry at offset O ∞ ∞ invalid entries invalid entries valid entries valid entries

  14. CORFU throughput sequencer bottleneck reads scale linearly

  15. how far is CORFU from Paxos? Paxos-like protocols are IO-bound at leader… D1 D3 D5 D7 … so is a single CORFU chain D2 D4 D6 D8 CORFU cluster Projection ‘stitches’ together multiple chains: no I/O bottleneck!

  16. conclusion CORFU is a scalable shared log: linearly scalable reads, 1M appends/s CORFU uses network-attached flash to construct inexpensive, power-efficient clusters

More Related