1 / 24

Presented by: Ashok Venkatesan

Multiplex: Unifying Conventional and Speculative Thread-Level Parallelism on a Chip Multiprocessor. Chong-Liang Ooi, Seon Wook Kim, II Park, Rudolf Eigenmann, Babak Falsafi and T.N. Vijayakumar. Presented by: Ashok Venkatesan. Outline. Background Thread Level Parallelism(TLP)

xenos
Download Presentation

Presented by: Ashok Venkatesan

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Multiplex: Unifying Conventional and Speculative Thread-Level Parallelism on a Chip Multiprocessor Chong-Liang Ooi, Seon Wook Kim, II Park, Rudolf Eigenmann, Babak Falsafi and T.N. Vijayakumar Presented by: Ashok Venkatesan

  2. Outline • Background • Thread Level Parallelism(TLP) • Explicit & Implicit TLP • An Example • Multiplex • Threading Model • MUCS protocol • Key Performance Factors • Performance Analysis • Conclusion

  3. Thread Level Parallelism • ILP Wall • Increasing CPI with increasing clock rates • Limited ILP in applications • Insufficient memory locality • Using TLP • Increased granularity of parallelism • Exploitation of Multi-cores • Threads: • A Logical sub-process that carries its own state. • State – Instructions, data, PC, register file, stack, etc.,

  4. Explicit & Implicit TLP • Explicit TLP • Program is explicitly partitioned into threads by programmer and an API is used to dispatch and execute on multiple cores. • Static – defined in the program • Main Overhead – Thread Dispatch • Implicit or Speculative TLP • Threads are peeled off from a sequential execution stream of the program by hardware prediction. • Dynamic – runtime prediction • Main Overhead – Speculative State Overflow

  5. Example – Exec Explicit Threads • Data Dependence is resolved using a barrier here • Dispatch of threads is done using a fork (System API) call

  6. Example – Exec Implicit Threads • Both data dependence as well as dispatch are handled by a hardware predictor

  7. Multiplex • Unifies explicit and implicit threading on a CMP • Obviates the need for serializing unanalyzable program segments by using speculative TLP • Avoids implicit threading’s speculation overhead and performance loss in compiler-analyzable program segments by using explicit threading. • Implements a single snoopy bus protocol to unify cache coherence with memory renaming and disambiguation.

  8. Anatomy of a Multiplex CMP

  9. Threading Model • Thread selection • Partitioning code into distinct instruction sequences. • Thread dispatch • Assigning threads to execute on different CPUs • Data communication and speculation • Propagating data between independent threads.

  10. Thread Selection in Multiplex • Methodology • Compiler chooses between threading models • Prioritizes explicit threading over implicit threading • Implicit threads selected by runtime speculation by hardware • However, software specifies implicit thread boundaries • Pros – Minimizes explicit and implicit overheads • Scenarios • Executing loops with small bodies implicitly • Executing tail ends of unevenly partitioned segments implicitly

  11. Thread Dispatch – An Overview • Dispatching conventional threads involve • Assigning PCs of CPUs the address of the first instruction of the thread • Assigning a private SP to CPUs • Copying stacks and register values prior to dispatch • Thread Descriptor – holds thread information • Stores the addresses of possible subsequent dispatch target threads • Holds register dependency information

  12. Thread Dispatch in Multiplex • Methodology • Predict subsequent threads based on current threads • Dispatch, execute and commit sequentially • Re-dispatch on squashing • Suspend dispatch upon mode switch to allow thread commits to complete • Instruction Set Changes - fork, stop and setsp • A Thread Predictor unit added to handle speculative prediction • A mode bit added to the Thread Descriptor • A TD Cache caches recently referenced descriptors

  13. MUCS Protocol • Mux Unified Coherence and Speculation - MUCS • Offers data coherence as well as versioning support • Key Design Objectives – minimize speculation overheads in two respects • Dependence resolution in the common case should be handled within the cache thereby minimizing bus transactions • Thread commit/squashes should only be done en masse and not as individual cache blocks.

  14. MUCS Protocol

  15. MUCS Protocol

  16. MUCS Protocol • 6 bits used for monitoring states of each cache block • Use – Set per speculative loadexecuted before store • Dirty – Set per speculative store in both modes • Commit – Set en masse on commit of speculative blocks • Stale – Set on a cache block when a newer version of data is available in another CPU • Squash – Set en masse on a cache touched by a squashed thread • Valid – Set per cache fill upon misses in both modes to determine validity of tag (not data)

  17. Key Performance Factors • Thread Size • Load Imbalance • Data Dependence • Thread dispatch/completion overhead • Speculative State Overflow

  18. Performance Analysis – System Info

  19. Performance Analysis – Best Case • Class 1 applications favor Implicit-only CMPs • Class 2 applications favor explicit-only CMPs • Avg Speedup of 4 dual issue CMP over one dual issue CMP • Implicit-only=1.14, Explicit-only=2.17, Multiplex = 2.3

  20. Performance Analysis - Overheads • I – implicit only, m - multiplex • Fpppp: provably parallel code = 0%, low squash buffer hits • wave5, tomcatv and swim have control flow irregularities in the inner loop i.e I/O stalls

  21. Performance Analysis – Cache Size • Effects of increasing cache size – performance increases • Multiplex incurs lesser overflow than implicit-only CMP • Effects of increasing data rates – performance decreases

  22. Conclusion • Coexistence of implicit and explicit multi-threading brings about a better speedup, showing a speedup of 2.63 during simulation • MUCS protocol allows such an implementation by mapping a coherence protocol needed for explicit threading to a subset of the states required for implicit threading and hence eliminates the need of extra hardware. • The dominant overheads for implicit and explicit threading are speculative state overflow and thread dispatching respectively.

  23. Questions?

  24. Thank you

More Related