1 / 39

Rebound: Scalable Checkpointing for Coherent Shared Memory

Rebound: Scalable Checkpointing for Coherent Shared Memory. Rishi Agarwal, Pranav Garg, and Josep Torrellas Department of Computer Science University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu. Checkpointing in Shared-Memory MPs. rollback. Fault. s ave c hkpt. s ave

stacey
Download Presentation

Rebound: Scalable Checkpointing for Coherent Shared Memory

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Rebound: Scalable Checkpointing for Coherent Shared Memory Rishi Agarwal, Pranav Garg, and Josep Torrellas Department of Computer ScienceUniversity of Illinois at Urbana-Champaignhttp://iacoma.cs.uiuc.edu

  2. Checkpointing in Shared-Memory MPs rollback Fault save chkpt save chkpt P1 P2 P3 P4 checkpoint checkpoint • HW-based schemes for small CMPs use Global checkpointing • All procs participate in system-wide checkpoints • Global checkpointing is not scalable • Synchronization, bursty movement of data, loss in rollback…

  3. Alternative: Coordinated Local Checkpointing P1 P2 P3 P4 P5 P1 P2 P3 P4 P5 Local Chkpt GlobalChkpt Local Chkpt + Scalable: Checkpoint and rollback in processor groups • Complexity: Record inter-thread dependences dynamically. • Idea: threads coordinate their checkpointing in groups • Rationale: • Faults propagate only through communication • Interleaving between non-comm. threads is irrelevant

  4. Contributions Rebound:First HW-based scheme for scalable, coordinated local checkpointing in coherent shared-memory • Leverages directory protocol to track inter-thread deps. • Opts to boost checkpointing efficiency: • Delaying write-back of data to safe memory at checkpoints • Supporting multiple checkpoints • Optimizing checkpointing at barrier synchronization • Avg. performance overhead for 64 procs: 2% • Compared to 15% for global checkpointing

  5. Background: In-Memory Checkpt with ReVive • [Prvulovic-02] Execution Register Dump P2 P3 P1 CHK Displacement Caches Dirty Cache lines Writebacks W W W W WB Checkpoint Application Stalls Writeback old old Logging old Log Memory

  6. Background: In-Memory Checkpt with ReVive • [Pvrulovic-02] Old Register restored P2 P3 P1 CHK Fault Caches Cache Invalidated W W W W WB Memory Lines Reverted Log Memory Local Coordinated Scalable protocol Global Broadcast protocol

  7. Coordinated Local Checkpointing Rules P1 P2 P1 P1 P2 P2 wr x rd x • P checkpoints  P’s producers checkpoint • P rolls back  P’s consumers rollback chkpt chkpt Consumer rollback Consumer chkpoint Producer rollback Producer chkpoint • Banatre et al. used Coordinated Local checkpointing for bus-based machines [Banatre96]

  8. Rebound Fault Model Chip Multiprocessor Main Memory Log (in SW) • Any part of the chip can suffer transient or permanent faults. • A fault can occur even during checkpointing • Off-chip memory and logs suffer no fault on their own (e.g. NVM) • Fault detection outside our scope: • Fault detection latency has upper-bound of L cycles

  9. Rebound Architecture Chip Multiprocessor Main Memory P+L1 MyProducer Dep Register MyConsumer L2 Directory Cache LW-ID

  10. Rebound Architecture Chip Multiprocessor Main Memory P+L1 MyProducer Dep Register MyConsumer L2 Directory Cache LW-ID • Dependence (Dep) registers in the L2 cache controller: • MyProducers: bitmap of proc. that produced data consumed by the local proc. • MyConsumers : bitmap of proc. that consumed data produced by the local proc.

  11. Rebound Architecture Chip Multiprocessor Main Memory P+L1 MyProducer Dep Register MyConsumer L2 Directory Cache LW-ID • Dependence (Dep) registers in the L2 cache controller: • MyProducers: bitmap of proc. that produced data consumed by the local proc. • MyConsumers : bitmap of proc. that consumed data produced by the local proc. • Processor ID in each directory entry: • LW-ID: last writer to the line in the current checkpoint interval.

  12. Recording Inter-Thread Dependences P1 P2 MyProducers MyProducers P1 writes MyConsumers MyConsumers Write LW-ID D P1 Log Memory Assume MESI protocol

  13. Recording Inter-Thread Dependences P1 MyConsumers P2 P2 MyProducers MyProducers P2 reads P1 MyConsumers MyConsumers P2 MyProducers P1 LW-ID D S P1 Write back Logging Log Memory Assume MESI protocol

  14. Recording Inter-Thread Dependences P1 P2 MyProducers MyProducers P1 writes P1 MyConsumers MyConsumers P2 LW-ID S P1 P1 D Log Memory Assume MESI protocol

  15. Recording Inter-Thread Dependences P1 Clear Depregisters P2 MyProducers MyProducers P1 checkpoints P1 MyConsumers MyConsumers P2 Clear LW-ID LW-ID shouldremain set tillthe line is checkpointed LW-ID S P1 Writebacks P1 D Logging Log Memory Assume MESI protocol

  16. Lazily clearing Last Writers • Clear LW-IDs  Expensive process ! • Write Signature encodes all line addresses that the processor has written to (or read exclusively) in the current interval. • At checkpoint, the processors clear their Write Signature • Potentially stale LW-ID

  17. Lazily clearing Last Writers P1 P2 NO ! MyProducers MyProducers P2 reads WSig MyConsumers MyConsumers Addr ? Clear LW-ID Stale LW-ID S P1 Log Memory

  18. Distributed Checkpointing Protocol in SW InteractionSet : P1 P1 P2 P3 P4 P1 chk initiate checkpoint • Interaction Set [Pi]: set of producer processors (transitively) for Pi • Built using MyProducers

  19. Distributed Checkpointing Protocol in SW InteractionSet : P1 P1 P2 P3 P4 P1 chk Ck? Ck? P3 P2 initiate checkpoint • Interaction Set [Pi]: set of producer processors (transitively) for Pi • Built using MyProducers

  20. Distributed Checkpointing Protocol in SW InteractionSet : P1 , P2, P3 P1 P2 P3 P4 P1 Accept Accept chk Ck? Ck? P3 P2 Ck? initiate checkpoint P4 • Interaction Set [Pi]: set of producer processors (transitively) for Pi • Built using MyProducers

  21. Distributed Checkpointing Protocol in SW InteractionSet : P1 , P2, P3 P1 P2 P3 P4 P1 Accept Accept chk Ck? Ck? P3 P2 Ack Ck? Decline initiate checkpoint P4 • Interaction Set [Pi]: set of producer processors (transitively) for Pi • Built using MyProducers

  22. Distributed Checkpointing Protocol in SW InteractionSet : P1 , P2, P3 P1 P2 P3 P4 P1 Accept Accept chk Ck? Ck? P3 P2 Ack Ck? Decline initiate checkpoint P4 • Checkpointing is a 2-phase commit protocol. • Interaction Set [Pi]: set of producer processors (transitively) for Pi • Built using MyProducers

  23. Distributed Rollback Protocol in SW • Rollback handled similar to the Checkpointing protocol: - Interaction set is built transitively using MyConsumers • Rollback involves • Clearing the Dep. Registers and Write Signature • Invalidating the processor caches • Restoring the data and register context from the logs up to the latest checkpoint. • No Domino Effect

  24. Optimization1 : Delayed Writebacks Time Interval I1 Interval I1 Stall sync sync Stall Checkpoint WB dirty lines Interval I2 Checkpoint WB dirty lines sync sync Interval I2 • Checkpointing overhead dominated by data writebacks • Delayed Writeback optimization • Processors synchronize and resume execution • Hardware automatically writes back dirty lines in background • Checkpoint only completed when all delayed data written back • Still need to record inter-thread dependences on delayed data

  25. Delayed Writeback Pros/Cons + Significant reduction in checkpoint overhead - Additional support: Each processor has two sets of Dep. Registers and Write Signature Each cache line has a delayed bit • Increased vulnerability A rollback event forces both intervals to roll back

  26. Delayed Writeback protocol MyConsumers0 P2 P1 P2 MyProducers0 MyProducers0 YES ! WSig0 xxx MyConsumers0 MyConsumers0 P2 Addr ? P2 reads MyProducers1 MyProducers1 P1 WSig1 MyConsumers1 MyConsumers1 NO ! Addr ? MyProducers1 P1 LW-ID D S P1 Write back Logging Log Memory

  27. Optimization2 : Multiple Checkpoints Dep registers 1 • Problem: Fault detection is not instantaneous • Checkpoint is safe only after max fault-detection latency (L) Rollback Dep registers 2 Detection Latency Ckpt 1 Fault Ckpt 2 tf • Solution: Keep multiple checkpoints • On fault, roll back interacting processors to safe checkpoints • No Domino Effect

  28. Multiple Checkpoints: Pros/Cons + Realistic system: supports non-instantaneous fault detection - Additional support: Each checkpoint has Dep registers Dep registers can be recycled only after fault detection latency • - Need to track communication across checkpoints • - Combination with Delayed Writebacks: one more Depregister set

  29. Optimization3 : Hiding Chkpt behind Global Barrier • Global barriers require that all processors communicate • Leads to global checkpoints • Optimization: • Proactively trigger a global checkpoint at a global barrier • Hide checkpoint overhead behind barrier imbalance spins

  30. Hiding Checkpoint behind Global Barrier • Lock • count++ • if(count == numProc) • Iam_last = TRUE /*local var*/ • Unlock • If(I am_last) { • count = 0 • flag = TRUE … • } • else • while(!flag) {} Update

  31. Hiding Checkpoint behind Global Barrier • Lock • count++ • if(count == numProc) • Iam_last = TRUE /*local var*/ • Unlock • If(I am_last) { • count = 0 • flag = TRUE … • } • else • while(!flag) {} Processor P3 Processor P1 Processor P2 Update Update Update BarCK? Update BarCK? Notify Notify flag = TRUE ICHK = {P3} while(!flag) ICHK = {P1, P3} while(!flag) ICHK = {P2, P3} First arriving processor initiates the checkpoint Others: HW writes back data as execution proceeds to barrier Commit checkpoint as last processor arrives After the barrier: few interacting processors

  32. Evaluation Setup • Analysis tool using Pin + SESC cycle-acc. simulator + DRAMsim • Applications: SPLASH-2 , some PARSEC, Apache • Simulated CMP architecture with up to 64 threads • Checkpoint interval : 5 – 8 ms • Modeled several environments: • Global: baseline global checkpointing • Rebound: Local checkpointing scheme with delayed writeback. • Rebound_NoDWB: Rebound without the delayed writebacks.

  33. Avg. Interaction Set: Set of Producer Processors 64 38 • Most apps: interaction set is a small set • Justifies coordinated local checkpointing • Averages brought up by global barriers

  34. Checkpoint Execution Overhead 15 2 • Rebound’s avg checkpoint execution overhead is 2% • Compared to 15% for Global

  35. Checkpoint Execution Overhead • Rebound’s avg checkpoint execution overhead is 2% • Compared to 15% for Global • Delayed Writebacks complement local checkpointing

  36. Rebound Scalability Constant problem size Rebound is scalable in checkpoint overhead Delayed Writebacks help scalability

  37. Also in the Paper Delayed write backs also useful in Global Barrier optimization is effective but not universally applicable Power increase due to hardware additions < 2% Rebound leads to only 4% increase in coherence traffic

  38. Conclusions Rebound:First HW-based scheme for scalable, coordinated local checkpointing in coherent shared-memory • Future work: • Apply Rebound to non-hardware coherent machines • Scalability to hierarchical directories • Leverages directory protocol • Boosts checkpointing efficiency: • Delayed write-backs • Multiple checkpoints • Barrier optimization • Avg. execution overhead for 64 procs: 2%

  39. Rebound: Scalable Checkpointing for Coherent Shared Memory Rishi Agarwal, Pranav Garg, and Josep Torrellas Department of Computer ScienceUniversity of Illinois at Urbana-Champaignhttp://iacoma.cs.uiuc.edu

More Related