1 / 146

Managing Wire Delay in CMP Caches

Managing Wire Delay in CMP Caches. Brad Beckmann Dissertation Defense Multifacet Project http://www.cs.wisc.edu/multifacet/ University of Wisconsin-Madison 8/15/06. L2 Bank. L2 Bank. Current CMP: AMD Athlon 64 X2. CPU 0. CPU 1. 2 CPUs. 2 L2 Cache Banks. CPU 0. CPU 1. L1 D$.

fifi
Download Presentation

Managing Wire Delay in CMP Caches

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Managing Wire Delay in CMP Caches Brad Beckmann Dissertation Defense Multifacet Project http://www.cs.wisc.edu/multifacet/ University of Wisconsin-Madison 8/15/06

  2. L2 Bank L2 Bank Current CMP: AMD Athlon 64 X2 CPU 0 CPU 1 2 CPUs 2 L2 Cache Banks

  3. CPU 0 CPU 1 L1 D$ L1 I$ L1 I$ L1 D$ L2 Bank L2 Bank L2 Bank L2 Bank L2 Bank L2 Bank L2 Bank L2 Bank CPU 5 CPU 6 CPU 3 CPU 4 CPU 7 CPU 2 L1 D$ L1 I$ L1 D$ L1 I$ L1 D$ L1 I$ L1 I$ L1 D$ L1 D$ L1 I$ L1 I$ L1 D$ CMP Cache Trends future technology (< 45 nm) today technology (~90 nm)

  4. Maximize Cache Capacity 40+ Cycles A Slow Access Latency Baseline: CMP-Shared L1 I $ L1 I $ L2 Bank L2 Bank CPU 3 CPU 4 L1 D $ L1 D $ L1 I $ L1 I $ L2 Bank L2 Bank CPU 2 CPU 5 L1 D $ L1 D $ L1 I $ L1 I $ L2 Bank L2 Bank CPU 1 CPU 6 L1 D $ L1 D $ L1 I $ L1 I $ L2 Bank L2 Bank CPU 0 CPU 7 L1 D $ L1 D $

  5. Fast Access Latency A Lower Effective Capacity A A Baseline: CMP-Private L1 I $ L1 I $ CPU 3 CPU 4 Private Private L1 D $ L1 D $ L2 L2 L1 I $ L1 I $ Private CPU 2 CPU 5 Private L1 D $ L1 D $ L2 L2 L1 I $ L1 I $ Private Private CPU 1 CPU 6 L1 D $ L1 D $ L2 L2 Thesis: both Fast Access & High Capacity L1 I $ L1 I $ Private Private CPU 0 CPU 7 L2 L1 D $ L1 D $ L2

  6. #1 #2 #3 #4 #5 Thesis Contributions • Characterizing CMP workloads—sharing types • Single requestor • Shared read-only • Shared read-write • Techniques to manage wire delay • Migration← Previously discussed • Selective Replication← Talk’s focus • Transmission Lines← Previously discussed • Combination outperforms isolated techniques

  7. Outline • Introduction • Characterization of CMP working sets • L2 requests • L2 cache capacity • Sharing behavior • L2 request locality • ASR: Adaptive Selective Replication • Cache block migration • TLC: Transmission Line Caches • Combination of techniques

  8. Characterizing CMP Working Sets • 8 processor CMP • 16 MB shared L2 cache • 64-byte block size • 64 KB L1 I&D caches • Profile L2 blocks during their on-chip lifetime • Three L2 block sharing types • Single requestor • All requests by a single processor • Shared read only • Read only requests by multiple processors • Shared read-write • Read and write requests by multiple processors • Workloads • Commercial: apache, jbb, otlp, zeus • Scientific: (SpecOMP) apsi & art(Splash) barnes & ocean

  9. Percent of L2 Cache Requests Majority of commercial workload requests for shared blocks Request Types

  10. Percent of L2 Cache Capacity Majority of Capacity for Single Requestor Blocks

  11. Costs of Replication • Decrease effective cache capacity • Storing replicas instead of unique blocks • Analyze average number of sharers • During on-chip lifetime • Increase store latency • Invalidate remote read-only copies • Run length [Eggers & Katz ISCA 88] • Average intervening remote reads between writes from the same processor + intervening reads between writes from different processors • For L2 requests

  12. Few intervening requests: Commercial Workloads Widely Shared: All Workloads Sharing Behavior requests breakdown

  13. High Locality Inter. Locality No Locality Low Locality Locality Graphs

  14. Request to Block Distribution: Single Requestor Blocks Lower Locality

  15. Request to Block Distribution: Shared Read Only Blocks High Locality L2 Cache MRU Hit Ratio

  16. Request to Block Distribution: Shared Read-Write Blocks Intermediate Locality

  17. Workload Characterization: Summary • Commercial workloads • significantshared read-only activity • Most of requests  42-71% • Little capacity without replication  9-21% • Highly shared  3.0-4.5 avg. processors • High request locality  3% of blocks account for 70% of requests • Shared read-only data great candidate for selective replication

  18. Outline • Introduction • Characterization of CMP working sets • ASR: Adaptive Selective Replication • Replication effect on memory performance • SPR: Selective Probabilistic Replication • Monitoring and adapting to workload behavior • Evaluation • Cache block migration • TLC: Transmission Line Caches • Combination of techniques

  19. Replication and Memory Cycles Memory cycles + (Pmiss x Lmiss) Instruction Instructions Average cycles for L1 cache misses (PlocalL2 x LlocalL2) + (PremoteL2 x LremoteL2) =

  20. Replication Benefit: L2 Hit Cycles L2 Hit Cycles Replication Capacity

  21. Replication and Memory Cycles Memory cycles (PlocalL2 x LlocalL2) + (PremoteL2 x LremoteL2) + Instruction Instructions Average cycles for L1 cache misses (Pmiss x Lmiss) =

  22. Replication Cost:L2 Miss Cycles L2 Miss Cycles Replication Capacity

  23. Optimal Replication Effectiveness:Total Cycles Total Cycle Curve Total Cycles Replication Capacity

  24. Outline • Wires and CMP caches • Characterization of CMP working sets • ASR: Adaptive Selective Replication • Replication effect on memory performance • SPR: Selective Probabilistic Replication • Monitoring and adapting to workload behavior • Evaluation • Cache block migration • TLC: Transmission Line Caches • Combination of techniques

  25. Identifying and Replicating Shared Read-only • Minimal coherence impact • Per cache block identification • Heuristic - not perfect • Dirty bit • Indicates written data • Leverage current bandwidth reduction optimization • Shared bit • Indicates multiple sharers • Set for blocks with multiple requestors

  26. SPR: Selective Probabilistic Replication • Mechanism for Selective Replication • Control duplication between L2 caches in CMP-Private • Relax L2 inclusion property • L2 evictions do not force L1 evictions • Non-exclusive cache hierarchy • Ring Writebacks • L1 Writebacks passed clockwise between private L2 caches • Merge with other existing L2 copies • Probabilistically choose between • Local writeback  allow replication • Ring writeback  disallow replication

  27. L1 I $ L1 D $ L1 I $ L1 D $ L1 I $ L1 D $ L1 I $ L1 D $ SPR: Selective Probabilistic Replication L1 I $ Private L2 Private L2 CPU 3 CPU 4 L1 D $ L1 I $ Private L2 Private L2 CPU 2 CPU 5 L1 D $ L1 I $ Private L2 Private L2 CPU 1 CPU 6 L1 D $ L1 I $ Private L2 Private L2 CPU 0 CPU 7 L1 D $

  28. SPR: Selective Probabilistic Replication Current Level Replication Capacity 3 5 1 4 0 2 Replication Levels real workloads

  29. Outline • Introduction • Characterization of CMP working sets • ASR: Adaptive Selective Replication • Replication effect on memory performance • SPR: Selective Probabilistic Replication • Implementing ASR • Evaluation • Cache block migration • TLC: Transmission Line Caches • Combination of techniques

  30. Implementing ASR • Four mechanisms estimate deltas • Decrease-in-replication Benefit • Increase-in-replication Benefit • Decrease-in-replication Cost • Increase-in-replication Cost • Triggering a cost-benefit analysis

  31. lower level current level ASR: Decrease-in-replication Benefit L2 Hit Cycles Replication Capacity

  32. ASR: Decrease-in-replication Benefit • Goal • Determine replication benefit decrease of the next lower level • Mechanism • Current Replica Bit • Per L2 cache block • Set for replications of the current level • Not set for replications of lower level • Current replica hits would be remote hits with next lower level • Overhead • 1-bit x 256 K L2 blocks = 32 KB

  33. higher level current level ASR: Increase-in-replication Benefit L2 Hit Cycles Replication Capacity

  34. ASR: Increase-in-replication Benefit • Goal • Determine replication benefit increase of the next higher level • Mechanism • Next Level Hit Buffers (NLHBs) • 8-bit partial tag buffer • Store replicas of the next higher • NLHB hits would be local L2 hits with next higher level • Overhead • 8-bits x 16 K entries x 8 processors = 128 KB

  35. lower level current level ASR: Decrease-in-replicationCost L2 Miss Cycles Replication Capacity

  36. ASR: Decrease-in-replication Cost • Goal • Determine replication cost decrease of the next lower level • Mechanism • Victim Tag Buffers (VTBs) • 16-bit partial tags • Store recently evicted blocks of current replication level • VTB hits would be on-chip hits with next lower level • Overhead • 16-bits x 1 K entry x 8 processors = 16 KB

  37. higher level current level ASR: Increase-in-replicationCost L2 Miss Cycles Replication Capacity

  38. ASR: Increase-in-replication Cost • Goal • Determine replication cost increase of the next higher level • Mechanism • Way and Set counters [Suh et al. HPCA 2002] • Identify soon-to-be-evicted blocks • 16-way pseudo LRU • 256 set groups • On-chip hits that would be off-chip with next higher level • Overhead • 255-bit pseudo LRU tree x 8 processors = 255 B • Overall storage overhead: 212 KB or 1.2% of total storage

  39. ASR: Triggering a Cost-Benefit Analysis • Goal • Dynamically adapt to workload behavior • Avoid unnecessary replication level changes • Mechanism • Evaluation trigger • Local replications or NLHB allocations exceed 1K • Replication change • Four consecutive evaluations in the same direction

  40. ASR: Adaptive Algorithm

  41. Outline • Introduction • Characterization of CMP working sets • ASR: Adaptive Selective Replication • Replication effect on memory performance • SPR: Selective Probabilistic Replication • Implementing ASR • Evaluation • Cache block migration • TLC: Transmission Line Caches • Combination of techniques

  42. Methodology • Full system simulation • Simics • Wisconsin’s GEMS Timing Simulator • Out-of-order processor • Memory system • Workloads • Commercial • apache, jbb, otlp, zeus • Scientific • Not shown here, in dissertation

  43. System Parameters [ 8 core CMP, 45 nm technology ]

  44. Replication Benefit, Cost, & Effectiveness Curves Benefit Cost

  45. Replication Benefit, Cost, & Effectiveness Curves Effectiveness 4 MB, 150 Memory latency

  46. ASR: Adapting to Workload Behavior Oltp: All CPUs

  47. ASR: Adapting to Workload Behavior Apache: All CPUs

  48. ASR: Adapting to Workload Behavior Apache: CPU 0

  49. ASR: Adapting to Workload Behavior Apache: CPUs 1-7

  50. Lack Dynamic Adaptation Comparison of Replication Policies • SPR  multiple possible policies • Evaluated 4 shared read-only replication policies • VR:Victim Replication • Previously proposed [Zhang ISCA 05] • Disallow replicas to evict shared owner blocks • NR: CMP-NuRapid • Previously proposed [Chishti ISCA 05] • Replicate upon the second request • CC:Cooperative Caching • Previously proposed [Chang ISCA 06] • Replace replicas first • Spill singlets to remote caches • Tunable parameter 100%, 70%, 30%, 0% • ASR:Adaptive Selective Replication • My proposal • Monitor and adjust to workload demand

More Related