1 / 36

Improving Cache Management Policies Using Dynamic Reuse Distances

Improving Cache Management Policies Using Dynamic Reuse Distances. Nam Duong 1 , Dali Zhao 1 , Taesu Kim 1 , Rosario Cammarota 1 , Mateo Valero 2 , Alexander V. Veidenbaum 1 1 University of California, Irvine 2 Universitat Politecnica de Catalunya and Barcelona Supercomputing Center.

adele
Download Presentation

Improving Cache Management Policies Using Dynamic Reuse Distances

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Improving Cache Management Policies Using Dynamic Reuse Distances Nam Duong1, Dali Zhao1, Taesu Kim1, Rosario Cammarota1, Mateo Valero2, Alexander V. Veidenbaum1 1University of California, Irvine 2Universitat Politecnica de Catalunya and Barcelona Supercomputing Center

  2. Cache Management • Have been a hot research topic Cache Management Single-core Shared-cache Replacement Bypass Partitioning Prefetch LRU NRU EELRU DIP RRIP … SPD … UCP PIPP TA-DIP TA-DRRIP Vantage … PDP PDP PDP

  3. Overview • Proposed new cache replacement and partitioning algorithms with a better balance between reuse and pollution • Introduced a new concept, Protecting Distance (PD), which is shown to achieve such a balance • Developed single- and multi-core hit rate models as a function of PD, cache configuration and program behavior • Models are used to dynamically compute the best PD • Showed that PD-based cache management policies improve performance for both single- and multi-core systems

  4. Outline • The concept of Protecting Distance • The single-core PD-based replacement and bypass policy (PDP) • The multi-core PD-based management policies • Evaluation

  5. Definitions • The (line) reuse distance: The number of accesses to the same cache set between two accesses to the same line • This metric is directly related to hit rate • The reuse distance distribution (RDD) • A distribution of observed reuse distances • A program signature for a given cache configuration • RDDs of representative benchmarks • X-axis: the RD (<256)

  6. Future Behavior Prediction • Cache management policies use past reference behavior to predict future accesses • Prediction accuracy is critical • Prediction in some of the prior policies • LRU: predicts that lines are reused after K unique accesses, where K < W (W: cache associativity) • Early eviction LRU (EELRU): Counts evictions in two non-LRU regions (early/late) to predict a line to evict • RRIP: Predicts if a line will be reused in a near, long, or distant future

  7. Balancing Reuse and Cache Pollution • Key to good performance (high hit rate) • Cache lines must be reused as much as possible before eviction • AND must be evicted soon after the “last” reuse to give space to new lines • The former can be achieved by using the reuse distance and actively preventing eviction • “Protecting” a line from eviction • The latter can be achieved by evicting when not reused within this distance • There is an optimal reuse distance balancing the two • It is called a Protecting Distance (PD)

  8. Example: 436.CactusADM • A majority of lines are reused at 64 or fewer accesses • There are multiple peaks at different reuse distances • Reuse maximized if lines are kept in the cache for 64 accesses • Lines may not be reused if evicted before that • Lines kept beyond that are likely to pollute cache • Assume that no lines are kept longer than a given RD

  9. The Protecting Distance (PD) • A distance at which a majority of lines are covered • A single value for all sets • Predicted based on the current RDD • Questions to answer/solve • Why does using the PD achieve the balance? • How to dynamically find the PD for an application and a cache configuration? • How to build the PD-based management policies?

  10. Outline • The concept of Protecting Distance • Single-core PD-based replacement and bypass policy (PDP) • The multi-core PD-based management policies • Evaluation

  11. The Single-core PDP Reused line Inserted line (unused) • A cache tag contains a line’s remaining PD (RPD) • A line can be evicted when its RPD=0 • The RPD of an inserted or promoted line set to the predicted PD • RPDs of other lines in a set are decremented • Example: A 4-way cache, the predicted PD is 7 • A line is promoted on a hit • A set with RPDs before and after the hit access 1 4 6 3 0 6 5 2

  12. The Single-core PDP (Cont.) Reused line Inserted line (unused) • Selecting a victim on a miss • A line with an RPD = 0 can be replaced • Two cases when all RPDs > 0 (no unprotected lines) • Caches without bypass (inclusive): • Unused lines are less likely to be reused than reused lines • Replace unused line with highest RPD first • No unused line: Replace a line with highest RPD • Caches with bypass (non-inclusive): Bypass the new line 0 4 6 3 6 3 5 2 1 4 6 3 0 3 5 6 1 4 6 3 0 3 6 2 1 4 6 3 0 3 5 2

  13. Evaluation of the Static PDP • Static PDP: use the best static PD for each benchmark • PD < 256 • SPDP-NB: Static PDP with replacement only • SPDP-B: Static PDP with replacement and bypass • Performance: in general, DDRIP < SPDP-NB < SPDP-B • 436.cactusADM: a 10% additional miss reduction • Two static PDP policies have similar performance • 483.xalancbmk: 3 different execution windows have different behavior for SPDP-B

  14. 436.cactusADM:Explaining the performance difference • How the evicted lines occupy the cache? • DRRIP: • Early evicted lines: 75% of accesses, but occupy only 4% • Late evicted lines: 2% of accesses, but occupy 8% of the cache → pollution • SPDP-NB: Early and late evicted lines: 42% of accesses but occupy only 4% • SPDP-B: Late evicted lines: 1% of accesses, occupy 3% of the cache → yielding cache space to useful lines • PDP has less pollution caused by long RD lines in the cache than RRIP

  15. Case Study: 483.xalancbmk • The best PD is different in different windows • And for different programs • Need a dynamic policy that finds best PD • Need a model to drive the search There is a close relationship between the hit rate, the PD and the RDD

  16. A Hit Rate Model For Non-inclusive Cache • The model estimates the hit rate as a function of dp and the RDD • {Ni}, Nt: The RDD • dp: The protecting distance • de: Experimentally set to W (W: Cache associativity) E RDD Hit rate • Used to find the PD maximizing the hit rate

  17. PDP Cache Organization Higher level Main memory Access address LLC • RD Sampler tracks access to several cache sets • In L2 miss/WB stream, can reduce sampling rate • Measures reuse distance of a new access • RD Counter Array collects # of accesses at RD=i, Nt • To reduce overhead, each counter covers a range of RDs • PD Compute Logic: finds PD that maximizes E • Computed PD used in the next interval (.5M L3 accesses) • Reasonable hardware overhead • 2 or 3 bits per tag to store the RPD PD PD Compute Logic RDD RD RD Sampler RD Counter Array

  18. PDP vs. Existing Policies (*)Originally proposed • EELRU has the concept of late eviction point, which shares some similarities with the protecting distance • However, lines are not always guaranteed to be protected • [1] Y. Smaragdakis, S. Kaplan, and P. Wilson. EELRU: simple and effective adaptive page replacement. In SIGMETRICS’99 • [2] M. K. Qureshi, A. Jaleel, Y. N. Patt, S. C. Steely, and J. Emer. Adaptive insertion policies for high performance caching. In ISCA’07 • [3] A. Jaleel, K. B. Theobald, S. C. Steely, Jr., and J. Emer. High performance cache replacement using re-reference interval prediction (RRIP). In ISCA’10 • [4] S. M. Khan, Y. Tian, and D. A. Jimenez. Sampling dead block prediction for last-level caches. In MICRO’10

  19. Outline • The concept of Protecting Distance • The single-core PD-based replacement and bypass policy (PDP) • The multi-core PD-based management policies • Evaluation

  20. PD-based Shared Cache Partitioning • Each thread has its own PD (thread-aware) • Counter array replicated per thread • Sampler and compute logic shared • A thread’s PD determines its cache partition • Its lines occupy cache longer if its PD is large • The cache is implicitly partitioned per needs of each thread using thread PDs • The problem is to find a set of thread PDs that together maximize the hit rate

  21. Shared-Cache Hit Rate Model • Extending the single-core approach • Compute a vector <PD> (T= number of threads) • Exhaustive search for <PD> is not practical • A heuristicsearch algorithm finds a combination of threads’ RDD peaks that maximizes hit rate • The single-core model generates top 3 peaks per thread • The complexity is O(T2) • See the paper for more detail

  22. Outline • The concept of Protecting Distance • The single-core PD-based replacement and bypass policy (PDP) • The multi-core PD-based management policies • Evaluation

  23. Evaluation Methodology • CMP$im simulator, LLC replacement • Target cache: LLC

  24. Evaluation Methodology (Cont.) • Benchmarks: SPEC CPU 2006 benchmarks • Excluded those which did not stress the LLC • Single-core: • Compared to EELRU, SDP, DIP, DRRIP • Multi-core • 4- and 16-core configurations, 80 workloads each • The workloads generated by randomly combining benchmarks • Compared to UCP, PIPP, TA-DRRIP • Our policy: PDP-x, where x is the number of bits per cache line

  25. Single-core PDP • PDP-x, where x is the number of bits per cache line • Each benchmark is executed for 1B instructions • Best if can use 3 bits per line, but still better than prior work at 2 bits

  26. Adaptation to Program Phases • 5 benchmarks which demonstrate significant phase changes • Each benchmark is run for 5B instructions • Change of PD (X-axis: 1M LLC accesses)

  27. Adaptation to Program Phases (Cont.) • IPC improvement over DIP

  28. PD-based Cache Partitioning for 16 cores • Normalized to TA-DRRIP

  29. Hardware Overhead

  30. Other Results • Exploration of PDP cache parameters • Cache bypass fraction • Prefetch-aware PDP • PD-based cache management policy for 4-core

  31. Conclusions • Proposed the concept of Protecting Distance (PD) • Showed that it can be used to better balance reuse and cache pollution • Developed a hit rate model as a function of the PD, program behavior, and cache configuration • Proposed PD-based management policies for both single- and multi-core systems • PD-based policies outperform existing policies

  32. Thank You!

  33. Backup Slides • RDD, E and hit rate of all benchmarks

  34. RDDs, Modeled and Real Hit Rates of SPEC CPU 2006 Benchmarks

  35. RDDs, Modeled and Real Hit Rates of SPEC CPU 2006 Benchmarks (Cont.)

  36. RDDs, Modeled and Real Hit Rates of SPEC CPU 2006 Benchmarks (Cont.)

More Related