1 / 10

ILP (Recap)

ILP (Recap). Instruction Level Parallelism. Basic Block (BB) ILP is quite small BB: a straight-line code sequence with no branches in except to the entry and no branches out except at the exit

kasia
Download Presentation

ILP (Recap)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ILP(Recap)

  2. Instruction Level Parallelism • Basic Block (BB) ILP is quite small • BB: a straight-line code sequence with no branches in except to the entry and no branches out except at the exit • average dynamic branch frequency 15% to 25% => 4 to 7 instructions execute between a pair of branches • Plus instructions in BB likely to depend on each other • To obtain substantial performance enhancements, we must exploit ILP across multiple basic blocks • Simplest: loop-level parallelism to exploit parallelism among iterations of a loop

  3. Loop Unrolling Example: Key to increasing ILP • For the loop: for (i=1; i<=1000; i++) x(i) = x(i) + s;; The straightforward MIPS assembly code is given by: Instruction Instruction Latency inproducing result using result clock cycles FP ALU op Another FP ALU op 3 FP ALU op Store double 2 Load double FP ALU op 1 Load double Store double 0 Integer op Integer op 0 Loop: L.D F0, 0 (R1) ADD.D F4, F0, F2 S.D F4, 0(R1) SUBI R1, R1, # 8 BNE R1,Loop

  4. Loop Showing Stalls and Code Re-arrangement 1 Loop: LD F0,0(R1) 2 stall 3 ADDD F4,F0,F2 4 stall 5 stall 6 SD 0(R1),F4 7 SUBI R1,R1,8 8 BNEZ R1,Loop 9 stall 9 clock cycles per loop iteration. 1Loop: LD F0,0(R1) 2 Stall 3 ADDD F4,F0,F2 4 SUBI R1,R1,8 5 BNEZ R1,Loop 6 SD 8(R1),F4 Code now takes 6 clock cycles per loop iteration Speedup = 9/6 = 1.5 • The number of cycles cannot be reduced further because: • The body of the loop is small • The loop overhead (SUBI R1, R1, 8 and BNEZ R1, Loop)

  5. 4n iterations n iterations 4 iterations 4 iterations Basic Loop Unrolling • Concept:

  6. Unroll Loop Four Times to expose more ILP and reduce loop overhead • 1 Loop: LD F0,0(R1) • 2 ADDD F4,F0,F2 • 3 SD 0(R1),F4 ;drop SUBI & BNEZ • 4 LD F6,-8(R1) • 5 ADDD F8,F6,F2 • 6 SD -8(R1),F8 ;drop SUBI & BNEZ • 7 LD F10,-16(R1) • 8 ADDD F12,F10,F2 • 9 SD -16(R1),F12 ;drop SUBI & BNEZ • 10 LD F14,-24(R1) • 11 ADDD F16,F14,F2 • 12 SD -24(R1),F16 • 13 SUBI R1,R1,#32 • 14 BNEZ R1,LOOP • 15 stall • 15 + 4 x (2 + 1)= 27 clock cycles, or 6.8 cycles per iteration (2 stalls after each ADDD and 1 stall after each LD) • 1 Loop: LD F0,0(R1) • 2 LD F6,-8(R1) • 3 LD F10,-16(R1) • 4 LD F14,-24(R1) • 5 ADDD F4,F0,F2 • 6 ADDD F8,F6,F2 • 7 ADDD F12,F10,F2 • 8 ADDD F16,F14,F2 • 9 SD 0(R1),F4 • 10 SD -8(R1),F8 • 11 SD -16(R1),F8 • 12 SUBI R1,R1,#32 • 13 BNEZ R1,LOOP • 14 SD 8(R1),F16 • 14 clock cycles or 3.5 clock cycles per iteration • The compiler (or Hardware) must be able to: • Determine data dependency • Do code re-arrangement • Register renaming

  7. Loop-Level Parallelism (LLP) Analysis • Loop-Level Parallelism (LLP) analysis focuses on whether data accesses in later iterations of a loop are data dependent on data values produced in earlier iterations. e.g. in for (i=1; i<=1000; i++) x[i] = x[i] + s; the computation in each iteration is independent of the previous iterations and the loop is thus parallel. The use of X[i] twice is within a single iteration. • Thus loop iterations are parallel (or independent from each other). • Loop-carried Dependence: A data dependence between different loop iterations (data produced in earlier iteration used in a later one) – limits parallelism. • Instruction level parallelism (ILP) analysis, on the other hand, is usually done when instructions are generated by the compiler.

  8. LLP Analysis Example 1 • In the loop: for (i=1; i<=100; i=i+1) { A[i+1] = A[i] + C[i]; /* S1 */ B[i+1] = B[i] + A[i+1];} /* S2 */ } (Where A, B, C are distinct non-overlapping arrays) • S2 uses the value A[i+1], computed by S1 in the same iteration. This data dependence is within the same iteration (not a loop-carried dependence). • does not prevent loop iteration parallelism. • S1 uses a value computed by S1 in an earlier iteration, since iteration i computes A[i+1] read in iteration i+1 (loop-carried dependence, prevents parallelism). The same applies for S2 for B[i] and B[i+1] • These two dependences are loop-carried spanning more than one iteration preventing loop parallelism.

  9. LLP Analysis Example 2 • In the loop: for (i=1; i<=100; i=i+1) { A[i] = A[i] + B[i]; /* S1 */ B[i+1] = C[i] + D[i]; /* S2 */ } • S1 uses the value B[i] computed by S2 in the previous iteration (loop-carried dependence) • This dependence is not circular: • S1 depends on S2 but S2 does not depend on S1. • Can be made parallel by replacing the code with the following: A[1] = A[1] + B[1]; for (i=1; i<=99; i=i+1) { B[i+1] = C[i] + D[i]; A[i+1] = A[i+1] + B[i+1]; } B[101] = C[100] + D[100]; Loop Start-up code Loop Completion code

  10. A[1] = A[1] + B[1]; B[2] = C[1] + D[1]; A[99] = A[99] + B[99]; B[100] = C[99] + D[99]; A[2] = A[2] + B[2]; B[3] = C[2] + D[2]; A[100] = A[100] + B[100]; B[101] = C[100] + D[100]; LLP Analysis Example 2 for (i=1; i<=100; i=i+1) { A[i] = A[i] + B[i]; /* S1 */ B[i+1] = C[i] + D[i]; /* S2 */ } Original Loop: Iteration 99 Iteration 100 Iteration 1 Iteration 2 . . . . . . . . . . . . Loop-carried Dependence • A[1] = A[1] + B[1]; • for (i=1; i<=99; i=i+1) { • B[i+1] = C[i] + D[i]; • A[i+1] = A[i+1] + B[i+1]; • } • B[101] = C[100] + D[100]; Modified Parallel Loop: Iteration 98 Iteration 99 . . . . Iteration 1 Loop Start-up code A[1] = A[1] + B[1]; B[2] = C[1] + D[1]; A[99] = A[99] + B[99]; B[100] = C[99] + D[99]; A[2] = A[2] + B[2]; B[3] = C[2] + D[2]; A[100] = A[100] + B[100]; B[101] = C[100] + D[100]; Not Loop Carried Dependence Loop Completion code

More Related