1 / 55

ECE8833 Polymorphous and Many-Core Computer Architecture

ECE8833 Polymorphous and Many-Core Computer Architecture. Lecture 6 Fair Caching Mechanisms for CMP. Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering. Cache Sharing in CMP [Kim, Chandra, Solihin, PACT’04]. Processor Core 1. Processor Core 2. L1 $. L1 $. L2 $. …….

early
Download Presentation

ECE8833 Polymorphous and Many-Core Computer Architecture

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ECE8833 Polymorphous and Many-Core Computer Architecture Lecture 6 Fair Caching Mechanisms for CMP Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering

  2. Cache Sharing in CMP [Kim, Chandra, Solihin, PACT’04] Processor Core 1 Processor Core 2 L1 $ L1 $ L2 $ …… [Kim, Chandra, Solihin PACT2004] Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU

  3. Cache Sharing in CMP Processor Core 1 Processor Core 2 ←t1 L1 $ L1 $ L2 $ …… Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU

  4. Cache Sharing in CMP Processor Core 1 Processor Core 2 t2→ L1 $ L1 $ L2 $ …… Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU

  5. Cache Sharing in CMP Processor Core 1 Processor Core 2 ←t1 t2→ L1 $ L1 $ L2 $ …… t2’s throughput is significantly reduced due to unfair cache sharing. Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU

  6. Shared L2 Cache Space Contention Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU

  7. time slice t1 t2 t3 t1 t4 time slice t1 t1 t1 t1 t1 t3 t3 t2 t2 t4 Impact of Unfair Cache Sharing • Uniprocessor scheduling • 2-core CMP scheduling • gzip will get more time slices than others if gzip is set to run at higher priority (and it could run slower than others  priority inversion) • It could further slows down the other processes (starvation) • Thus the overall throughput is reduced (uniform slowdown) P1: P2: 7

  8. HIT Counters Value CTR Pos 0 CTR Pos 1 CTR Pos 2 CTR Pos 3 30 20 15 10 Misses = 25 Stack Distance Profiling Algorithm CTR Pos0 CTR Pos1 CTR Pos2 CTR Pos3 HIT Counters Cache Tag MRU LRU [Qureshi+, MICRO-39]

  9. Stack Distance Profiling • A counter for each cache way, C>A is the counter for misses • Show the reuse frequency for each way in a cache • Can be used to predict the misses for associativity smaller than “A” • Misses for 2-way cache for gzip = C>A + Σ Ciwhere i = 3 to 8 • art does not need all the space for likely poor temporal locality • If the given space is halved for art and given to gzip, what happens?

  10. Fairness Metrics [Kim et al. PACT’04] • Uniform slowdown Execution time of ti when it runs alone. Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU

  11. Fairness Metrics [Kim et al. PACT’04] • Uniform slowdown Execution time of ti when it shares cache with others. Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU

  12. Fairness Metrics [Kim et al. PACT’04] • Uniform slowdown • We want to minimize: • Ideally: Try to equalize the ratio of miss increase of each thread Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU

  13. Fairness Metrics [Kim et al. PACT’04] • Uniform slowdown • We want to minimize: • Ideally: Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU

  14. LRU LRU LRU LRU Partitionable Cache Hardware • Modified LRU cache replacement policy • G. E. Suh, et. al., HPCA 2002 Per-thread Counter Current Partition Target Partition P1: 448B P1: 384B P2: 576B P2: 640B P2 Miss Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU

  15. LRU LRU LRU LRU LRU LRU * * LRU LRU Partitionable Cache Hardware • Modified LRU cache replacement policy • G. Suh, et. al., HPCA 2002 Current Partition Target Partition P1: 448B P1: 384B P2: 576B P2: 640B P2 Miss Partition granularity could be as coarse as one entire cache way Current Partition Target Partition P1: 384B P1: 384B P2: 640B P2: 640B Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU

  16. MissRate shared P1: P2: Repartitioning interval Target Partition P1: P2: Dynamic Fair Caching Algorithm MissRate alone Counters to keep miss rates running the process alone (from stack distance profiling) Ex) Optimizing M3 metric P1: P2: Counters to keep dynamic miss rates (running with a shared cache) 10K accesses found to be the best Counters to keep target partition size Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU

  17. MissRate shared MissRate shared P1: P1:20% P2: P2:15% Repartitioning interval Target Partition P1:256KB P2:256KB Dynamic Fair Caching Algorithm MissRate alone 1st Interval P1:20% P2: 5% Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU

  18. MissRate shared P1:20% P2:15% Repartitioning interval Target Partition Target Partition P1:192KB P1:256KB P2:256KB P2:320KB Dynamic Fair Caching Algorithm MissRate alone Repartition! P1:20% P2: 5% Evaluate M3 P1: 20% / 20% P2: 15% / 5% Partition granularity: 64KB Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU

  19. MissRate shared MissRate shared P1:20% P1:20% P2:15% P2:15% Repartitioning interval Target Partition P1:192KB P2:320KB Dynamic Fair Caching Algorithm MissRate alone 2nd Interval P1:20% P2: 5% Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU

  20. MissRate shared MissRate shared P1:20% P1:20% P2:15% P2:10% Repartitioning interval Target Partition Target Partition P1:192KB P1:128KB P2:384KB P2:320KB Dynamic Fair Caching Algorithm MissRate alone Repartition! P1:20% P2: 5% Evaluate M3 P1: 20% / 20% P2: 10% / 5% Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU

  21. MissRate shared MissRate shared MissRate shared P1:25% P1:20% P1:20% P2: 9% P2:10% P2:10% Repartitioning interval Target Partition P1:128KB P2:384KB Dynamic Fair Caching Algorithm MissRate alone 3rd Interval P1:20% P2: 5% Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU

  22. MissRate shared MissRate shared P1:25% P1:20% P2:10% P2: 9% Repartitioning interval Target Partition Target Partition P1:128KB P1:192KB P2:320KB P2:384KB Dynamic Fair Caching Algorithm MissRate alone Do Rollback if: P2: Δ<Trollback Δ=MRold-MRnew Repartition! P1:20% P2: 5% The best Trollback threshold found to be 20% Slide courtesy: Seongbeom Kim, D. Chandra and Y. Solihin @NCSU

  23. Generic Repartitioning Algorithm Pick the largest and smallest as a pair for repartitioning Repeat for all candidate processes

  24. Utility-Based Cache Partitioning (UCP)

  25. # of ways given (1 to 16) Running Processes on Dual-Core [Qureshi & Patt, MICRO-39] • LRU: in real runs on avg., 7 ways were allocated to equake and 9 to vpr • UTIL • How much you use (in a set) is how much you will get • Ideally, 3 ways to equake and 13 to vpr # of ways given (1 to 16)

  26. Defining Utility Utility Uab = Misses with a ways – Misses with b ways Low Utility Misses per 1000 instructions High Utility Saturating Utility Num ways from 16-way 1MB L2 Slide courtesy: Moin Qureshi, MICRO-39

  27. PA UMON2 UMON1 Framework for UCP Shared L2 cache I$ I$ Core1 Core2 D$ D$ Main Memory Three components: • Utility Monitors (UMON) per core • Partitioning Algorithm (PA) • Replacement support to enforce partitions Slide courtesy: Moin Qureshi, MICRO-39

  28. (MRU) H0 H1 H2 H3 H15 (LRU) ... + + + + + Utility Monitors (UMON) • For each core, simulate LRU policy using Auxiliary Tag Dir (ATD) • UMON-global (one way-counter for all sets) • Hit counters in ATD to count hits per recency position • LRU is a stack algorithm: hit counts  utility E.g., hits(2 ways) = H0+H1 Set A Set B Set C Set D Set E Set F Set G Set H ATD

  29. (MRU) H0 H1 H2 H3 H15 (LRU) ... + + + + + Utility Monitors (UMON) • Extra tags incur hardware and power overhead • DSS reduces overhead [Qureshi et al. ISCA’06] Set A Set A Set B Set B Set C Set C Set D Set D Set E Set E Set F Set F Set G Set G Set H Set H ATD

  30. (MRU) H0 H1 H2 H3 H15 (LRU) ... + + + + + Utility Monitors (UMON) • Extra tags incur hardware and power overhead • DSS reduces overhead [Qureshi et al. ISCA’06] • 32 sets sufficient based on Chebyshev’s inequality • Sample every 32 sets (simple static) used in the paper • Storage < 2KB/UMON (or 0.17% L2) Set A Set B Set B Set E Set C Set F Set D UMON (DSS) Set E Set F Set G Set H ATD

  31. Partitioning Algorithm (PA) • Evaluate all possible partitions and select the best • With aways to core1 and (16-a) ways to core2: Hitscore1 = (H0 + H1 + … + Ha-1) ---- from UMON1 Hitscore2 = (H0 + H1 + … + H16-a-1) ---- from UMON2 • Select a that maximizes (Hitscore1 + Hitscore2) • Partitioning done once every 5 million cycles • After each partitioning interval • Hit counters in all UMONs are halved • To retain some past information

  32. ways_occupied < ways_given Yes No Victim is the LRU line from miss-causing app Victim is the LRU line from other app Replacement Policy to Reach Desired Partition Use way partitioning [Suh+ HPCA’02, Iyer ICS’04] • Each Line contains core-id bits • On a miss, count ways_occupied in the set by miss-causing app • Binary decision for dual-core (in this paper)

  33. UCP Performance (Weighted Speedup) UCP improves average weighted speedup by 11% (Dual Core)

  34. UPC Performance (Throughput) UCP improves average throughput by 17%

  35. Dynamic Insertion Policy

  36. Conventional LRU MRU LRU Incoming Block Slide Source: Yuejian Xie Slide Source: Yuejian Xie

  37. Conventional LRU MRU LRU Occupies one cache blockfor a long time with no benefit! Slide Source: Yuejian Xie

  38. LIP: LRU Insertion Policy [Qureshi et al. ISCA’07] MRU LRU Incoming Block Slide Source: Yuejian Xie 38

  39. LIP: LRU Insertion Policy [Qureshi et al. ISCA’07] MRU LRU Adapted Slide from Yuejian Xie

  40. LIP: LRU Insertion Policy [Qureshi et al. ISCA’07] MRU LRU LIP is not entirely new, Intel has tried this in 1998 when designing “Timna” (integrating CPU and Gfx accelerator that share L2) Slide Source: Yuejian Xie

  41. BIP: Bimodal Insertion Policy [Qureshi et al. ISCA’07] LIP may not age older lines Infrequently insert lines in MRU position Let e = Bimodal throttle parameter if ( rand() < e ) Insert at MRU position; // LRU replacement policyelse Insert at LRU position; Promote to MRU if reused

  42. DIP BIP LRU 1-ε ε LIP LRU DIP: Dynamic Insertion Policy [Qureshi et al. ISCA’07] • Two types of workloads: LRU-friendly or BIP-friendly • DIP can be implemented by: • Monitor both policies (LRU and BIP) • Choose the best-performing policy • Apply the best policy to the cache Need a cost-effective implementation  “Set Dueling”

  43. miss LRU-sets + BIP-sets – miss Follower Sets MSB = 0? No YES Use LRU Use BIP Set Dueling for DIP [Qureshi et al. ISCA’07] Divide the cache in three: • Dedicated LRU sets • Dedicated BIP sets • Follower sets (winner of LRU,BIP) n-bit saturating counter misses to LRU sets:counter++ misses to BIP sets : counter-- Counter decides policy for follower sets: • MSB = 0, Use LRU • MSB = 1, Use BIP n-bit cntr monitor  choose  apply (using a single counter) Slide Source: Moin Qureshi

  44. Promotion/Insertion Pseudo Partitioning

  45. PIPP [Xie & Loh ISCA’09] • What’s PIPP? • Promotion/Insertion Pseudo Partitioning • Achieving both capacity (UCP) and dead-time management (DIP). • Eviction • LRU block as the victim • Insertion • The core’s quota worth of blocks away from LRU • Promotion • To MRU by only one. Insert Position = 3 (Target Allocation) New Promote To Evict MRU LRU Hit Slide Source: Yuejian Xie 45

  46. PIPP Example Core0’s Block Core1’s Block Core0 quota: 5 blocks Core1 quota: 3 blocks Request D Core1’s quota=3 1 A 2 3 4 B 5 C MRU LRU Slide Source: Yuejian Xie

  47. PIPP Example Core0’s Block Core1’s Block Core0 quota: 5 blocks Core1 quota: 3 blocks Request 6 Core0’s quota=5 1 A 2 3 4 D B 5 MRU LRU Slide Source: Yuejian Xie

  48. PIPP Example Core0’s Block Core1’s Block Core0 quota: 5 blocks Core1 quota: 3 blocks Request 7 Core0’s quota=5 1 A 2 6 3 4 D B MRU LRU Slide Source: Yuejian Xie

  49. PIPP Example Core0’s Block Core1’s Block Core0 quota: 5 blocks Core1 quota: 3 blocks Request D 1 A 2 7 6 3 4 D MRU LRU Slide Source: Yuejian Xie

  50. How PIPP Does Both Management MRU LRU Insert closer to LRU position Slide Source: Yuejian Xie 50

More Related