1 / 36

EE 382N Guest Lecture Wish Branches

EE 382N Guest Lecture Wish Branches. Hyesoon Kim HPS Research Group The University of Texas at Austin. Lecture Outline. Predicated execution Wish branches 2D-profiling. Motivation. Branch predictors are still not perfect .

hieu
Download Presentation

EE 382N Guest Lecture Wish Branches

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. EE 382N Guest LectureWish Branches Hyesoon Kim HPS Research Group The University of Texas at Austin

  2. Lecture Outline • Predicated execution • Wish branches • 2D-profiling

  3. Motivation • Branch predictors are still not perfect. • Deeper pipeline andlarger instruction window increase the branch misprediction penalty. • Predicated execution can eliminate branch misprediction by converting control-dependency to data dependency. However, predicated code has overhead.

  4. (normal branch code) A A T N if (cond) { b = 0; } else { b = 1; } B C B C D D A p1 = (cond) branch p1, TARGET B mov b, 1 jmp JOIN C TARGET: mov b,0 Predicated Execution (predicated code) Convert control flow dependency to data dependency Pro: Eliminate hard-to-predict branches A p1 = (cond) (!p1) mov b,1 (p1) mov b,0 B C D add x, b, 1 Cons: (1) Fetch blocks B and C all the time (2) Wait until p1 is resolved

  5. The Overhead of Predicated Execution -2% 16% 13% non-predicated p1 = (cond) (!p1) mov b,1 (p1) mov b,0 p1 = (cond) (0) mov b,1 (1)mov b,0 A B C D add x, b, 1 (Predicated code) If all overhead is ideally eliminated, predicated execution would provide 16% improvement in average execution time

  6. The Problem • Due to the predication overhead, predicated execution sometimes reduces performance • Branch misprediction characteristics are dependent on run-time behavior: input set, control-flow path andphase behavior. The compiler cannot accurately estimate the run-time behavior of branches

  7. A A T N B C B C D D Predicated Code Performance vs. Branch Misprediction Rate Predicated code performs better • Converting a branch to predicated code could hurt performance if run-time misprediction rate is lower than profile-time misprediction rate run-time (input B) profile-time (input A) X Normal branch code performs better • Execution time(normal branch code) = exec_T * P(T) + exec_N * P(N) + misp_penalty * P(misprediction) • Execution time of predicated code = exec_pred

  8. Lecture Outline • Predicated execution • Wish branches • 2D-profiling

  9. Wish Branches [Kim et al. Micro-38] • A new type of control flow instruction 3 types: wish jump/join and wish loop • The compilergenerates code (with wish branches) that can be executed either as predicated code or non-predicated code (normal branch code) • The hardwaredecides to execute predicated code or normal branch code at run-time based on the confidence of branch prediction • Easy to predict: normal branch code • Hard to predict: predicated code

  10. A A T N B C B C D D A A p1 = (cond) (!p1) mov b,1 (p1) mov b,0 B B mov b, 1 jmp JOIN C C TARGET: mov b,0 normal branch code predicated code Wish Jump/Join High Confidence Low Confidence A wish jump nop B wish join Taken Not-Taken C D A p1=(cond) wish.jump p1 TARGET p1 = (cond) branch p1, TARGET B nop (!p1) mov b,1 wish.join !p1Join (1) mov b,1 wish.join (1)Join C TARGET: (1) mov b,0 TARGET: (p1) mov b,0 D JOIN: wish jump/join code

  11. do { a++; i++; } while (i<N); Wish Loop H X T X T N N High Confidence Low Confidence Y Y H mov p1, 1 LOOP: (p1) add a, a, 1 (p1) add i, i, 1 (p1) p1 = (cond) wish. loopp1, LOOP EXIT: X X LOOP: add a, a, 1 add i, i, 1 p1 = (i<N) branch p1, LOOP EXIT: (1) (1) (1) Y Y wish loop code normal backward branch code

  12. Mispredicted Case 1: Early-Exit Compared to normal branch code: predicate data dependency and one extra instruction(-) H X1 X2 X3 Y H Correct execution: T T N X T Early-exit: (Low confidence) Flush pipeline N H X1 X2 Y … T N Y X3 Y N

  13. Mispredicted Case 2: Late-Exit Compared to normal branch code: pro: reduce flush penalty (+++) cons: predicate data dependency and one extrainstruction(-) H Correct execution: X1 X2 X3 Y H T T N X T nop nop Late-exit: (Low confidence) N H X1 X2 X3 X4 X5 Y … T T T T N Y

  14. Mispredicted Cases3: No-Exit No-Exit: predicate data dependency and one extra instruction(-) H Correct execution: X1 X2 X3 Y H T T N nop nop Late-exit: X T H X1 X2 X3 X4 X5 Y … N T T T T N Flush pipeline Y No-exit: H X1 X2 X3 X4 X5 X6 … T T T T T Y

  15. Questions? • What kind of branches should be converted to wish branches (jump/join)? • Why not all branches? • What kind of branches should be converted to wish loops?

  16. Advantages/Disadvantages of Wish Branches • Advantages compared to predicated execution • Reduce the overhead of predication • Increase the benefits of predicated code by allowing the compiler to generate more aggressively-predicated code • Provide a mechanism to exploit predication to reduce the branch misprediction penalty for backward branches (Wish loops) • Make predicated code less dependent on machine configuration (e.g. branch predictor)

  17. Advantages/Disadvantages of Wish Branches • Disadvantages compared to predicated execution • Extra branch instructions use machine resources • Extra branch instructions increase the contention for branch predictor table entries • May constrain the compiler’s scope for code optimizations

  18. Wish Branch Support • ISA Support • predicated execution, wish branch instruction • Compiler Support Wish branch generation algorithms The compiler needs to decide which branches are predicated, which are converted to wish branches, and which stay as normal branches • Hardware Support Instruction decode logic Predicate dependency elimination module Confidence estimator Front-end and branch misprediction detection/recovery module

  19. ISA Support • Using existing hint bits (IA-64, x86, PowerPC) • Hint bits can be ignored. A wish branch can be treated as a normal branch. OPCODE btypewtype target offset p btye: branch type (0:normal branch 1:wish branch) wtype: wish branch type (0:jump 1:loop 2:join) p: predicate register identifier

  20. Wish Branch Support • ISA Support • predicated execution, wish branch instruction • Compiler Support • Wish branch generation algorithms The compiler needs to decide which branches are predicated, which are converted to wish branches, and which stay as normal branches • Hardware Support Instruction decode logic Predicate dependency elimination module Confidence estimator Front-end and branch misprediction detection/recovery module

  21. select candidates cost-benefit analysis wish jump conversion predicate selected blocks edge/value profiling if-conversion branch elimination wish join insertion wish loop conversion loop opt Compiler Support Major phase ordering with wish branch generation in code generation [ORC] region formation if-conversion loop opt (swp, unrolling) global inst. sched register allocation modified local inst. sched new existing

  22. Wish Branch Generation Algorithm • wish jump/join candidates: all branch which are suitable for if-conversion • The number of instructions in the fall-through block > N (N=5) : wish jump and join are inserted • All other branches converted to predicated code • A loop branch is converted into a wish loop: when the loop body has fewer than L instructions (L=30)

  23. Wish Branch Support • ISA Support • predicated execution, wish branch instruction • Compiler Support • Wish branch generation algorithms The compiler needs to decide which branches are predicated, which are converted to wish branches, and which stay as normal branches • Hardware Support • Instruction decode logic • Predicate dependency elimination module • Front-end and branch misprediction detection/recovery module • Confidence estimator

  24. Hardware Support • Instruction Fetch/decode logic Decoder: decode wish branches BTB: mark wish branches • Wish branch state machine hardware • Wish loop stays as low-confidence mode until the loop exits • Predicate dependency elimination module • High-confidence mode: predicate values are predicted • Branch misprediction detection/recovery module • No flush if wish branch is mispredicted during low-confidence mode • Confidence estimator

  25. > th? JRS Confidence Estimator Estimate how much confidence the processor has in a branch prediction Trained with branch misprediction information Assigning Confidence to Conditional Branch Predictions [Jacobsen et al. Micro-29] n bit Counters m bits PC + 2^m entries High Confidence Low Confidence Global BHR

  26. Experimental Infrastructure • IA-64 provides full support for predication • Convert IA-64 traces to micro-ops to simulate an out-of-order superscalar processor model Source Code IA-64 Binary IA-64 Trace µops IA-64 Compiler (ORC) Micro-op Translator Micro-op Simulator Trace generation module

  27. Simulation Methodology • Nine SPEC 2000 integer benchmarks • Baseline Processor Configuration • Front End • Large and accurate branch predictor(64KB hybrid branch predictor: gshare + local) • Minimum 30-cycle branch misprediction penalty • 64KB, 2-cycle latency I-cache • Execution Core • 8-wide out-of-order processor • 512-entry instruction window • Confidence Estimator • 1KB tagged 16-bit history JRS confidence estimator (Jacobsen et al. MICRO-29)

  28. Performance Improvement -4% 14% 2.02 8% 24% non-predicated 16% over conditional branch prediction (w/o mcf) 11% over selective-predication (w/o mcf) 7 % over aggressive predication (w/o mcf) 14% over conditional branch prediction and 13% over selective-predication and 16% over aggressive-predication 12% over conditional branch prediction 11% over selective-predication 13 % over aggressive predication SELECTIVE-PREDICATION: branches are selectively predicated using compile-time cost-benefit analysis AGGRESSIVE-PREDICATION: all branches that are suitable for if-conversion are predicated

  29. Wish Branch: Conclusion • New control flow instructions: wish branches (jump/join/loop) • Wish branches improve performance by dividing the work of predication between the compiler and the microarchitecture • Compiler: analyzes the control-flow graph and generates code • Microarchitecture: makes run-time decision to use predication • Wish branches provide significant performance benefits • 16% compared to conditional branch prediction • 13% compared to selectively predicated code • Wish branches can make predicated execution more viable and effective in high performance processors • By enablingadaptive and aggressive predicated execution

  30. Lecture Outline • Predicated execution • Wish branches • 2D-profiling

  31. 2D-profiling • Goal: Identify input-dependent branches by using a single input set for profiling • If We Know a Branch is Input-Dependent • May not convert it to predicated code. • May convert it to a wish branch. • May not perform other compiler optimizations or may perform them less aggressively. • Hot-path/trace/superblock-based optimizations [Fisher’81, Pettis’90, Hwu’93, Merten’99]

  32. input-dependent input-independent Key Insight of 2D-profiling Phase behavior in prediction accuracy is a good indicator of input dependence phase 2 phase 3 phase 1

  33. brA time brB time Traditional Profiling pr. Acc MEAN pr.Acc(brA) pr. Acc MEAN pr.Acc(brB) MEAN pr.Acc(brA)  MEAN pr.Acc(brB) behavior of brA  behavior of brB

  34. brA time brB time 2D-profiling pr. Acc MEAN pr.Acc(brA) STD pr.Acc(brA) pr. Acc MEAN pr.Acc(brB) STD pr.Acc(brB) MEAN pr.Acc(brA)  MEAN pr.Acc(brB) STD pr.Acc(brA) ≠ STD pr.Acc(brB) behavior of brA ≠ behavior of brB A: input-dependent br, B: input-independent br

  35. 2D-profiling Mechanism • The profiler collects branch prediction accuracy information for every static branchover time slice size = M instructions Slice 1 Slice 2 … Slice N time mean Pr.Acc(brA,s1) mean Pr.Acc(brA,s2) ... mean Pr.Acc(brA,sN) mean Pr.Acc(brB,s1) mean Pr.Acc(brB,s2) ... mean Pr.Acc(brB,sN) . . . . . . . . . PAM:50% brA mean brA Calculate MEAN (brA, brB, …), Standard deviation (brA, brB, …), PAM:Points Above Mean (brA, brB, …) brB PAM:0% mean brB

  36. 2D-profiling: Conclusion & Future Work • 2D-profiling is a new profiling technique to find input-dependent characteristics by using a single input data set for profiling • 2D-profiling uses time-varying information instead of just average data • Phase behavior in prediction accuracy in a profile run  input-dependent • Future Work: • Better predicated code/wish branch generation algorithms • Detecting other input-dependent program characteristics

More Related