1 / 19

Is SC + ILP = RC?

Is SC + ILP = RC?. Chris Gniady, Babak Falsafi, and T. N. VijayKumar - Purdue University. Presented by Vamshi Kadaru. Spring 2005: CS 7968 Parallel Computer Architecture. Introduction. Availability of multiprocessors (how to maximize performance?) Atomicity of operations (synchronization)

sharne
Download Presentation

Is SC + ILP = RC?

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Is SC + ILP = RC? Chris Gniady, Babak Falsafi, and T. N. VijayKumar - Purdue University Presented by Vamshi Kadaru Spring 2005: CS 7968 Parallel Computer Architecture

  2. Introduction • Availability of multiprocessors (how to maximize performance?) • Atomicity of operations (synchronization) • Allow in-order processors to overlap store latency with other work (ie bypassing loads, overlapping with network latency etc) • Allow processors to execute out-of-order (speculation) • There exists a trade off between programmability and performance • To simplify programming, implement a shared memory abstraction Spring 2005: CS 7968 Parallel Computer Architecture

  3. Memory Models • Shared memory systems implement memory consistency models • Different models make different guarantees; the processor can reorder/overlap memory operations as long as the guarantees are upheld. • Sequential Consistency (SC) is the simplest model which executes memory operations in program order • Relaxed memory models require only some memory operations to perform in program order • Release Consistency is the best of the relaxed memory models Spring 2005: CS 7968 Parallel Computer Architecture

  4. Current Memory Consistency Models • Sequential Consistency (SC) • HP and MIPS processors • Processor Consistency (PC) • Intel processors • Total Store Order • Sun SPARC • Release Consistency (RC) • Sun SPARC, DEC Alpha, IBM PowerPC Spring 2005: CS 7968 Parallel Computer Architecture

  5. Current Optimizations • Techiniques used to exploit ILP • Branch prediction • Execute multiple instructions per cycle • Non-blocking caches to overlap memory operations • Out-of-order execution • Implement precise exceptions and speculative execution • Reorder buffer Spring 2005: CS 7968 Parallel Computer Architecture

  6. Comparing SC and RC • Sequential Consistency (SC) • Guarantees memory order using hardware • Easier to program • Prevents high performance due to conservative nature • Release Consistency (RC) • Guarantees memory order using software • Harder to program; more burden on programmer • Achieves highest performance due to explicitness Spring 2005: CS 7968 Parallel Computer Architecture

  7. SC Implementations • Current SC use ILP Optimizations • Hardware prefetching and non-blocking caches to overlap loads and stores using the reorder buffer • Speculative load execution using reorder buffer and a special history buffer to roll back in case of invalidation • Limitations • Inability of stores to bypass other memory operations • Long latency remote stores cause the relative small reorder buffer and load/store queue to fill up blocking the pipeline • Capacity and conflict misses of small L2 caches causing frequent rollbacks Spring 2005: CS 7968 Parallel Computer Architecture

  8. RC Implementations • RC allows a programmer to specify the ordering constraints (fence instr) among specific memory operations to enforce order • RC implementations use store buffering to allow loads and store to bypass pending stores • Unlike SC, RC can use binding prefetches to perform loads in the reorder buffer • RC can also relax ordering among fence instrns and use rollback mechanisms if there is a memory model violation Spring 2005: CS 7968 Parallel Computer Architecture

  9. SC programmability with RC Perfor. • SC can approach RC if hardware can provide support for: • SC to relax the order speculatively of loads and stores • Loads and stores to take place atomically and in program order • Instructions to be allowed to execute out of program order • Processor state must be remembered for rollbacks • Limitations (costs) • Memory order is arbitrary; no guarantees • Rollbacks must be infrequent (enough space needed) Spring 2005: CS 7968 Parallel Computer Architecture

  10. SC++ Architecture • Modelled after R10k • SHiQ allows for prefetching and non-blocking caches • Other processors see SC • History buffer allows speculative retirement • unblocks RoB stores • Load/store queue takes stores from RoB • BLT has block addr’s for SHiQ Spring 2005: CS 7968 Parallel Computer Architecture

  11. Experimental Setup • Simulator: RSIM, on an 8-node DSM • Each DSM node is a R10k like processor • Memory model implementations use • Non-blocking caches • Hardware prefetching for loads and stores • Speculative load execution • No speculative retirement is done in either SC or RC Spring 2005: CS 7968 Parallel Computer Architecture

  12. Base System Configuration • Each R10k processor node has the above configuration • Large L2 cache – eliminates capacity and conflict misses • Base configuration is used unless otherwise specified Spring 2005: CS 7968 Parallel Computer Architecture

  13. Some points to remember  • SC and RC implementations… • Use non-blocking caches • Use hardware prefetching for loads and stores • Perform speculative loads • SC++ uses… • Speculative History Queue (SHiQ) • Block Lookup Table (BLT) • Rollbacks due to Instructions in reorder buffer take one cycle • Rollbacks due to Instructions in SHiQ take 4 cycles Spring 2005: CS 7968 Parallel Computer Architecture

  14. Results – Base System • Speedup normalized to that of SC implementation • RC is better than SC • Best for radix • SC++ performs better than or equal to RC • For raytrace it performs way better Spring 2005: CS 7968 Parallel Computer Architecture

  15. Results – Network Latency • Network latency increased by 4x • RC hides the n/w latency by overlapping stores • SC++inf keeps up with RC • raytrace performs lesser since longer n/w latency dominates lock patterns. Spring 2005: CS 7968 Parallel Computer Architecture

  16. Results – Reorder Buffer Size • Allows more prefetch time • Speeds up both SC and RC • Hides store latencies by allowing more time for prefetches • In raytrace, no speedup in both SC and RC • Memory operations don’t overlap much • In structured, the gap grows • Due to increase in no. of rollbacks in SC Spring 2005: CS 7968 Parallel Computer Architecture

  17. Res - SHiQ Size & Speculative Stores • Absence of speculative stores causes significance performance loss • radix and raytrace • Reducing SHiQ sizes leads to performance degradation • em3d and radix Spring 2005: CS 7968 Parallel Computer Architecture

  18. Results – L2 Caches Size • Two effects of smaller L2 cache • Less room for speculative state => gap widens • Lots of load misses for both SC and RC => might narrow performance gap • Lu & radix – the high load miss rate degrades performance • SC ++ is also sensitive to rollbacks due to replacements Spring 2005: CS 7968 Parallel Computer Architecture

  19. Conclusions • SC can perform equal to RC if hardware provides enough support for speculation • SC++ allows for speculative bypassing for both loads and stores • SC++ minimizes additional overheads to the processor pipeline critical paths by using the following structures • SHiQ: to store speculative state, absorb remote latencies • BLT: to allow fast lookups in SHiQ Spring 2005: CS 7968 Parallel Computer Architecture

More Related