1 / 30

Federation: Repurposing Scalar Cores for Out-of-Order Instruction Issue

Federation: Repurposing Scalar Cores for Out-of-Order Instruction Issue. David Tarjan*, Michael Boyer, and Kevin Skadron* University of Virginia Department of Computer Science * Currently on internship/sabbatical at NVIDIA Research. L2. L2. L2. L2. L2. L2. L2. L2. L2. L2. L2. L2.

binh
Download Presentation

Federation: Repurposing Scalar Cores for Out-of-Order Instruction Issue

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Federation: Repurposing Scalar Cores for Out-of-Order Instruction Issue David Tarjan*, Michael Boyer, and Kevin Skadron* University of Virginia Department of Computer Science * Currently on internship/sabbatical at NVIDIA Research

  2. L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 Multithreadedscalar IO core 2-wayOO core Motivation Adaptive(Federation) Homogeneous Heterogeneous

  3. Basic Insights • A multithreaded in-order core has many registers which can be reused for a reorder buffer oractive list • If cores are small, single cycle communication between neighbors is feasible • Prior work on making large OOO cores feasible can be applied at the low end to make low-cost OOO possible

  4. Fetch Decode In-order & Out-of-order Pipelines In-order Out-of-order Fetch Bpred Decode Execute Execute Mem Allocate Mem Writeback Rename Writeback Issue Commit

  5. Ready Bits Subscriber Slot 1 Subscriber Slot 2 1 2 3 4 5 Issue Queue Example + 1 1 1 IQ2 IQ3 1 0 1 IQ3 + 2 0 1 0 1 + 3 Huang et al., Energy-Efficient Hybrid Wakeup Logic, ISLPED 2002 Sassone et al., Matrix Scheduler Reloaded, ISCA 2007

  6. Simplified Load-Store Queue • Memory Alias Table (MAT) • No store forwarding • No conservative waiting on stores • Only detect memory order violations after they have occurred and flush the pipeline when the offending instruction commits Amir Roth, Store Vulnerability Window (SVW): Re-Execution Filtering for Enhanced Load Optimization, ISCA 2005

  7. MAT 0 0 0 1 0 2 0 3 0 4 5 0 0 6 7 0 MAT Example st 0x13, r5 ld r1, 0x13

  8. MAT 0 0 1 0 0 2 3 1 4 0 5 0 6 0 7 0 MAT Example st 0x13, r5 ld r1, 0x13 EXE ld executes and increments counter

  9. MAT 0 0 1 0 2 0 1 ! 3 0 4 0 5 0 6 7 0 MAT Example st 0x13, r5 COM ld r1, 0x13 st commits and sets flag

  10. MAT 0 0 1 0 2 0 1 ! 3 0 4 0 5 0 6 7 0 MAT Example ld r1, 0x13 COM Flush ld commits, sees flag, and flushes pipeline

  11. MAT 0 0 0 1 0 2 3 0 4 0 5 0 0 6 7 0 MAT Example ld r1, 0x13 MAT is reset and execution resumes

  12. Performance Impact

  13. Performance

  14. Energy Efficiency

  15. Area Efficiency

  16. Conclusions • Two in-order cores can be federated at run-time to form a 2-way OO core • Almost doubling IPC of throughput core is possible with very little extra hardware • Don’t want traditional OO structures because their performance comes at too high a price • Best combined area- and energy-efficiency

  17. Q & A

  18. Backup

  19. Core Fusion Data Figure from Ipek et al., “Core Fusion: Accommodating Software Diversity in Chip Multiprocessors” , ISCA 2007

  20. Overall Results • Scalar in-order core is 8KB I/D, 256KB L2 • Base 2-way core has 16KB I and D-Caches, 256KB L2, 32 entry ROB, 16 entry issue queue, 16 entry LSQ, bimodal bpred • 4-way core is 32KB I/D, 2MB L2, 128 entry ROB, 32 IQ and LSQ, tournament bpred

  21. Branch Prediction • Use only a Next Line and Set (NLS) predictor, Bimodal predictor and a Return Address Stack (RAS) • NLS ok if your instruction working set not > I$ size • Small bimodal predictor ik ok for small window processor

  22. Fetch • Two I$’s act as a I$ of twice the size and associativity (and random replacement) • More logic and buffers to capture two instructions • Extra cycle to route instructions from two I$’s to two decoders

  23. Decode • Cancel second instruction if first turns out to be branch • Extra cycle to route decoded instructions to new allocate stage

  24. Allocate • New logic and free lists to allocate ROB, IQ entries

  25. Rename • New table since it has too many ports • One, centralized rename table, not distributed • Has separate table (or field in each RAT entry) for each registers producer instructions IQ-slot number (see our new issue queue)

  26. Issue • Uses a simple lookup table as wakeup structure, where instructions subscribe to their input instructions (explained in detail later) • Centralized, one IQ for the two cores

  27. Register File • Register file is mirrored in the two cores • No extra copy instructions or load-balancing questions

  28. Execute • Add extra cycle for copying result to other core’s register file (like EV6)

  29. Memory Access • The two D$s are checked in parallel, each responsible for half of the merged D$’s ways • No standard LSQ, only a Memory Alias Table (details later) • Only detects ordering violations and send signal to pipeline

  30. Commit • Centralized commit, no slippage • Recover from branch mispredictions since no checkpoints of RAT on branches • Recover from memory order violations (or false positives) from MAT

More Related