1 / 27

Slackened Memory Dependence Enforcement: Combining Opportunistic Forwarding with Decoupled Verification

Slackened Memory Dependence Enforcement: Combining Opportunistic Forwarding with Decoupled Verification. Alok Garg, M. W. Rashid, and Michael Huang Department of Electrical & Computer Engineering University of Rochester. Motivation.

africa
Download Presentation

Slackened Memory Dependence Enforcement: Combining Opportunistic Forwarding with Decoupled Verification

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Slackened Memory Dependence Enforcement:Combining Opportunistic Forwarding with Decoupled Verification Alok Garg, M. W. Rashid, and Michael Huang Department of Electrical & Computer Engineering University of Rochester

  2. Motivation • Out-of-order execution needs efficient memory dependence enforcement logic • Conventional approach – complex, hard to scale • Tightly coupled forwarding and enforcement • We use two decoupled components to simplify the task • Opportunistic forwarding using L0 cache • Verification against in-order re-execution • Slackened memory dependence enforcement (SMDE) "Slackened Memory Dependence Enforcement", Alok Garg, ISCA 2006

  3. LSQ: complex & hard to scale • Needs priority CAMs • Forwarding from LSQ on timing critical path • Serialized with address translation • Design further complicated by • Coherence and consistency considerations • Corner cases: e.g., partial overlap of operands "Slackened Memory Dependence Enforcement", Alok Garg, ISCA 2006

  4. Highlights of prior work • Two-level load store queue [sethumadhavan03], [akkary03], [baugh04], [roth04], [torres05], [gandhi05] • Reducing search frequency using clever filtering and prediction mechanism [park03], [sethumadhavan03] • Memory dependence prediction [moshovos.isca97], [moshovos.micro97], [sha05], [stone05] • Value based re-execution[cain04], [roth04], [sha05] (more detailed contrast in paper) "Slackened Memory Dependence Enforcement", Alok Garg, ISCA 2006

  5. Outline • Overview of SMDE • Optional performance optimizations • Evaluation • Conclusion "Slackened Memory Dependence Enforcement", Alok Garg, ISCA 2006

  6. Overview of SMDE "Slackened Memory Dependence Enforcement", Alok Garg, ISCA 2006

  7. Decoupled execution LSQ • LSQ: competing requirements • Front-end execution: little mem dependence enforcement • Back-end execution: detect violations (mem access only) • Memory B/W: naturally handled Fetch/Decode/Dispatch Execution (out-of-order) Commit Front-end execution Back-end execution MUX L0 L1 Memory Hierarchy "Slackened Memory Dependence Enforcement", Alok Garg, ISCA 2006

  8. Why it works – two perspectives • Back-end execution is the only one required • Totally in-order, preserving dependence • Any front-end execution is OK • L0 effectively a slow but accurate value predictor • Front-end execution correct most of the time • Common case: 99% of loads happen at right time • Speculation is on timing of load store pairs • Two-level LSQ speculate on the scope of stores • Relatively expensive replays OK "Slackened Memory Dependence Enforcement", Alok Garg, ISCA 2006

  9. Advantage – simplicity • No priority CAM • Decoupled design – flexible, modular • Front end – large degree of freedom • No need for address translation • Soft errors can be ignored (ECC not needed) • Corner cases – handle partial overlaps naturally • Can ignore coherence invalidations "Slackened Memory Dependence Enforcement", Alok Garg, ISCA 2006

  10. Performance of naïve design LQ: 64 SQ: 48 "Slackened Memory Dependence Enforcement", Alok Garg, ISCA 2006

  11. Optional performance optimizations "Slackened Memory Dependence Enforcement", Alok Garg, ISCA 2006

  12. Reducing replay frequency • Major replay cause – RAW violations • 48% replays due to RAW violation • Replays indirectly cause more replays • Often address available (data is not) • Fuzzy disambiguation queue (FDQ) • Reject known premature loads • Best effort enough, no need to guarantee anything • Conventional LSQ handles this (e.g., POWER 4) "Slackened Memory Dependence Enforcement", Alok Garg, ISCA 2006

  13. FDQ: How it works 1 2 3 4 5 6 ST LD Old New ROB Address AGE Address Address AGE 2 Fuzzy Disambiguation Queue Address AGE Address AGE Address AGE "Slackened Memory Dependence Enforcement", Alok Garg, ISCA 2006

  14. FDQ not complex • Very different from conventional SQ • Does not have priority logic • No need to merge with cache data path • Small queue is sufficient – no scalability pressure • Stores do not stay in FDQ for the entire lifetime • Flexible replacement • A “local” technique • Only support needed load rejection • No need to augment issue logic to enforce predicted dependence "Slackened Memory Dependence Enforcement", Alok Garg, ISCA 2006

  15. Write buffer at the back-end • Temporarily holds not yet committed stores • Allow back-end execution of loads and stores to start early • A few entries sufficient to streamline back-end execution "Slackened Memory Dependence Enforcement", Alok Garg, ISCA 2006

  16. Evaluation "Slackened Memory Dependence Enforcement", Alok Garg, ISCA 2006

  17. Evaluation environment • Simulator strives to model SMDE very faithfully • Load speculation, load rejection, and store-load replay • Data value in the caches • Scheduling replays • Do not allocate load queue entry for pre-fetches • SPEC CPU2000 benchmark suite • System configuration • ROB/Register (INT, FP) – 512/(400,400) • LSQ (LQ, SQ) – 112 (64, 48) • L0 speculative cache – 16KB, 2-way, 1 cycle "Slackened Memory Dependence Enforcement", Alok Garg, ISCA 2006

  18. Impact of 8-entry Write buffer "Slackened Memory Dependence Enforcement", Alok Garg, ISCA 2006

  19. Replay frequency reduction (a) Integer applications. (b) Floating-point applications. "Slackened Memory Dependence Enforcement", Alok Garg, ISCA 2006

  20. Replay breakdown "Slackened Memory Dependence Enforcement", Alok Garg, ISCA 2006

  21. Performance improvement "Slackened Memory Dependence Enforcement", Alok Garg, ISCA 2006

  22. Scalability test Memory dependence logic unchanged ROB, RFs, IQs doubled "Slackened Memory Dependence Enforcement", Alok Garg, ISCA 2006

  23. Other details in paper • Scope of replay • Detailed study on replay causes • Replay suppression technique • Age based filtering • Discussion on L0 flush policy • Understanding write buffer • Membership test for write buffer * “Implementation Issues of Slackened Memory Dependence Enforcement”, A. Garg, M. Rashid, and M. Huang, Technical Report. "Slackened Memory Dependence Enforcement", Alok Garg, ISCA 2006

  24. Conclusions • Common-case forwarding and correctness guarantee separately handled • Decoupled execution allows modular design, verification, and optimization • Forwarding logic is simple to design and incurs minimal interference on execution • Scales very well • Can achieve close to ideal performance "Slackened Memory Dependence Enforcement", Alok Garg, ISCA 2006

  25. Slackened Memory Dependence Enforcement:Combining Opportunistic Forwarding with Decoupled Verification Alok Garg, M. W. Rashid, and Michael Huang Department of Electrical & Computer Engineering University of Rochester Link to technical report: http://www.ece.rochester.edu/~garg/documents/isca06tr.pdf

  26. Streamlining back-end execution Cycles 1 2 3 4 5 6 7 1 1 Age – old to new 2 2 Verification commit ST ST Bubble LD LD LD LD ROB Reload 3 3 "Slackened Memory Dependence Enforcement", Alok Garg, ISCA 2006

  27. Streamlining back-end execution Insert write buffer at the commit stage Cycles 1 2 3 4 5 6 7 1 1 Age – old to new 2 2 ST WB CT LD RL RL LD ROB 3 3 "Slackened Memory Dependence Enforcement", Alok Garg, ISCA 2006

More Related