1 / 22

Decoupled Architecture for Data Prefetching

Decoupled Architecture for Data Prefetching. <chang@cs.wisc.edu> <xuk@cs.wisc.edu>. Jichuan Chang Kai Xu. Outline. Motivation Design and Evaluation Results Conclusions. Motivation. Processor-memory performance gap Prefetching helps, but it has overhead.

ursa
Download Presentation

Decoupled Architecture for Data Prefetching

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Decoupled Architecture for Data Prefetching <chang@cs.wisc.edu> <xuk@cs.wisc.edu> Jichuan Chang Kai Xu CS752

  2. Outline • Motivation • Design and Evaluation • Results • Conclusions CS752

  3. Motivation • Processor-memory performance gap • Prefetching helps, but it has overhead. • Transistor is cheap, will a coprocessor help? Main Processor Prefetching CoProcessor Info Flow Cache Prefetch Requests Data L1-L2 Internal Bus CS752

  4. Why a dedicated coprocessor? • Simple • It simplifies the design of main processor. • Powerful • It can (hopefully) exploit complex algorithms; • It handles computation overhead (i.e. pattern computation, address computation). • Flexible • It can (hopefully) adapt to different situations; • It can implement different algorithms. • But are these true? CS752

  5. Info Flow Main Processor Prefetching CoProcessor Tables RPT, PPW, CT, History, … … Cache Stream Buffer Prefetch Requests Data Bus The Generic Design ALU What ? When ? Where ? CS752

  6. Data Prefetching Techniques • Regular Access Prefetching • Tagged Next Block Lookahead [Smith 82] • Exploit sequential access pattern; • Stride Prefetching [Baer & Chen 91] • Exploit stride access pattern; • Dependency-based Prefetching [Roth, et al 98] • Discover Linked-Data-Structure access pattern • Dead Block Correlation [Lai, et al 01] • History based correlation prediction • Stream Buffer [Joppi 90] • Reduce cache pollution CS752

  7. Simulation Settings • SimpleScalar v3.0 • Modified sim-outorder to implement • information sharing between MP and PCP; • Modified cache module to implement • Prefetching schemes (between L1 and L2 cache), • Prefetch queue (len = 16); Bus sharing/contention, • Stream buffer. • Memory Parameters • L1 Data Cache: 4KB, 32B line, 4-way associative; • L2 Cache: 64KB, 64B line, 4-way associative; • Stream buffer: 8 entries, fully associated, 1 cycle hit; • Hit latency (cycle): L1 = 1 L2 = 12 Mem = 70 (2*); • Pipelined bus: bus contention/latency are modeled. CS752

  8. Benchmarks • From Spec95 • gcc • compress • swim • tomcatv • Microbenchmark • Matrix multiplication (128 X 128 double) • Binary tree (1M nodes, similar to treeadd) CS752

  9. Results (IPC) CS752

  10. Results (Miss Ratio) CS752

  11. Results (Prefetch Accuracy) CS752

  12. L1-L2 Traffic Increase CS752

  13. Results (Delay Tolerance) • How many cycles of delay can PCP tolerate? • More delay • Less useful (can’t get back before demand references) • More pollution (due to outdated information) • Less prefetches (due to bus contention) • To avoid pollution, impl. prefetch queue as circular buffer. • Overwrite outdated entries when queue is full. • The major effect of larger delay will be less prefetches. • Hard to model memory behavior in SimpleScalar • Predetermine latency, no wake-up, no MSHR. CS752

  14. Delay tolerance • Preliminary result • For almost all schemes on all benchmarks: • PCP can tolerant 8 cycles of delay CS752

  15. Can we integrate different schems? • Different applications need different schems • Brute force approach • Use both tagged and stride prefetching • Good speedup, but much more memory traffic. • Adapt prefetching policy dynamically? • Share the same hardware table • Using similar matching schemes • Hard to reconfigure/flush when context-swithes • Use separate tables • More hardware • Similar to tournament predictor (just a thought) CS752

  16. Conclusions • PCP helps performance! (2-30% speedup) • PCP handles prefetching, can tolerates some delays. • Different schemes work for different applications • Requires different information (from different places); • PCP should be placed close to the info source; • Not easy to integrate different schemes. • Limitation of our approach • PCP not fully utilized. • Relies on tables (caches/queues/buffers) • DBCP requires large history table (7.6 M memory)! • Delay is critical to performance • It limits the complexity of prefetch schemes, • It also determines where to place PCP. CS752

  17. Future Work • To evaluate more prefetching schemes • Dependency-based prefetching, etc. • PCP Running Ahead • Probably with the help of trace cache; • To fully utilize PCP; • Need chkpt/rollback mechanisms. • CoProcessor to Support Other Functionalities • Branch prediction, power mgmt. • PCP for Multiprocessor • Suitable for One-Block-Lookahead. • Need to change CC protocol. CS752

  18. Thank You! Questions? CS752

  19. Backup Slides Gauges

  20. Tagged Prefetching CS752

  21. Stride Prefetching • Recurrence Prediction Table (RPT) • Organized like a cache, indexed by PC • (Data addresses, stride, state) • State Machine CS752

  22. Dependency-based Prefetching • Potential Producer Window • Correlation Table • One Step Ahead • Jump Pointer Generation/Maintenance CS752

More Related