1 / 51

Loop Mangling

Loop Mangling. With both thanks and apologies to Allen and Kennedy. Preliminaries. Want to parallelize programs, so Concentrate on loops Types of parallelism Vectorization Instruction-level parallelism Courser grain parallelism Shared memory Distributed memory Message passing

Download Presentation

Loop Mangling

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Loop Mangling With both thanks and apologies to Allen and Kennedy

  2. Preliminaries • Want to parallelize programs, so • Concentrate on loops • Types of parallelism • Vectorization • Instruction-level parallelism • Courser grain parallelism • Shared memory • Distributed memory • Message passing • Formalism based upon dependence analysis

  3. Transformations • We call a transformation safe if the transformed program has the same "meaning" as the original program • But, what is the "meaning" of a program? For our purposes: • Two computations are equivalent if, on the same inputs: • They produce the same outputs in the same order

  4. Reordering Transformations • A reordering transformation is any program transformation that merely changes the order of execution of the code, without adding or deleting any executions of any statements

  5. Properties of Reordering Transformations • A reordering transformation does not eliminate dependences • However, it can change the ordering of the dependence which will lead to incorrect behavior • A reordering transformation preserves a dependence if it preserves the relative execution order of the source and sink of that dependence.

  6. Fundamental Theorem of Dependence • Fundamental Theorem of Dependence: • Any reordering transformation that preserves every dependence in a program preserves the meaning of that program • Proof by contradiction. Theorem 2.2 in the Allen and Kennedy text

  7. Fundamental Theorem of Dependence • A transformation is said to be valid for the program to which it applies if it preserves all dependences in the program.

  8. Parallelization and Vectorization • It is valid to convert a sequential loop to a parallel loop if the loop carries no dependence. • Want to convert loops like: DO I=1,N X(I) = X(I) + C ENDDO • toX(1:N) = X(1:N) + C(Fortran 77 to Fortran 90) • However: DO I=1,N X(I+1) = X(I) + C ENDDO is not equivalent to X(2:N+1) = X(1:N) + C

  9. Loop Distribution • Can statements in loops which carry dependences be vectorized? D0 I = 1, N S1 A(I+1) = B(I) + C S2 D(I) = A(I) + E ENDDO • Dependence: S1 1S2can be converted to: S1 A(2:N+1) = B(1:N) + C S2 D(1:N) = A(1:N) + E

  10. Loop Distribution DO I = 1, N S1 A(I+1) = B(I) + C S2 D(I) = A(I) + E ENDDO • transformed to: • DO I = 1, N • S1 A(I+1) = B(I) + C • ENDDO • DO I = 1, N • S2 D(I) = A(I) + E • ENDDO • leads to: • S1 A(2:N+1) = B(1:N) + C • S2 D(1:N) = A(1:N) + E

  11. Loop Distribution • Loop distribution fails if there is a cycle of dependences DO I = 1, N S1 A(I+1) = B(I) + C S2 B(I+1) = A(I) + E ENDDO S11 S2 and S21 S1 • What about: DO I = 1, N S1 B(I) = A(I) + E S2 A(I+1) = B(I) + C ENDDO

  12. Fine-Grained Parallelism Techniques to enhance fine-grained parallelism: • Loop Interchange • Scalar Expansion • Scalar Renaming • Array Renaming

  13. Motivational Example DO J = 1, M DO I = 1, N T = 0.0 DO K = 1,L T = T + A(I,K) * B(K,J) ENDDO C(I,J) = T ENDDO ENDDO However, by scalar expansion, we can get: DO J = 1, M DO I = 1, N T$(I) = 0.0 DO K = 1,L T$(I) = T$(I) + A(I,K) * B(K,J) ENDDO C(I,J) = T$(I) ENDDO ENDDO

  14. Motivational Example DO J = 1, M DO I = 1, N T$(I) = 0.0 DO K = 1,L T$(I) = T$(I) + A(I,K) * B(K,J) ENDDO C(I,J) = T$(I) ENDDO ENDDO

  15. Motivational Example II • Loop Distribution gives us: DO J = 1, M DO I = 1, N T$(I) = 0.0 ENDDO DO I = 1, N DO K = 1,L T$(I) = T$(I) + A(I,K) * B(K,J) ENDDO ENDDO DO I = 1, N C(I,J) = T$(I) ENDDO ENDDO

  16. Motivational Example III Finally, interchanging I and K loops, we get: DO J = 1, M T$(1:N) = 0.0 DO K = 1,L T$(1:N) = T$(1:N) + A(1:N,K) * B(K,J) ENDDO C(1:N,J) = T$(1:N) ENDDO • A couple of new transformations used: • Loop interchange • Scalar Expansion

  17. Loop Interchange DO I = 1, N DO J = 1, M S A(I,J+1) = A(I,J) + B •DV: (=, <) ENDDO ENDDO • Applying loop interchange: DO J = 1, M DO I = 1, N S A(I,J+1) = A(I,J) + B •DV: (<, =) ENDDO ENDDO • leads to: DO J = 1, M S A(1:N,J+1) = A(1:N,J) + B ENDDO

  18. Scalar Expansion DO I = 1, N S1 T = A(I) S2 A(I) = B(I) S3 B(I) = T ENDDO • Scalar Expansion: DO I = 1, N S1 T$(I) = A(I) S2 A(I) = B(I) S3 B(I) = T$(I) ENDDO T = T$(N) • leads to: S1 T$(1:N) = A(1:N) S2 A(1:N) = B(1:N) S3 B(1:N) = T$(1:N) T = T$(N)

  19. Scalar Expansion • However, not always profitable. Consider: DO I = 1, N T = T + A(I) + A(I+1) A(I) = T ENDDO • Scalar expansion gives us: T$(0) = T DO I = 1, N S1 T$(I) = T$(I-1) + A(I) + A(I+1) S2 A(I) = T$(I) ENDDO T = T$(N)

  20. Scalar Renaming DO I = 1, 100 S1 T = A(I) + B(I) S2 C(I) = T + T S3 T = D(I) - B(I) S4 A(I+1) = T * T ENDDO • Renaming scalar T: DO I = 1, 100 S1T1 = A(I) + B(I) S2 C(I) = T1 + T1 S3T2 = D(I) - B(I) S4 A(I+1) = T2 * T2 ENDDO

  21. Scalar Renaming • will lead to: S3 T2$(1:100) = D(1:100) - B(1:100) S4 A(2:101) = T2$(1:100) * T2$(1:100) S1 T1$(1:100) = A(1:100) + B(1:100) S2 C(1:100) = T1$(1:100) + T1$(1:100) T = T2$(100)

  22. Array Renaming DO I = 1, N S1 A(I) = A(I-1) + X S2 Y(I) = A(I) + Z S3 A(I) = B(I) + C ENDDO • S1  S2 S2 -1 S3 S3 1 S1 S1 0 S3 • Rename A(I) to A$(I): DO I = 1, N S1 A$(I) = A(I-1) + X S2 Y(I) = A$(I) + Z S3 A(I) = B(I) + C ENDDO • Dependences remaining: S1  S2 andS3 1 S1

  23. Seen So Far... • Uncovering potential vectorization in loops by • Loop Interchange • Scalar Expansion • Scalar and Array Renaming • Safety and Profitability of these transformations

  24. And Now ... • More transformations • Node Splitting • Recognition of Reductions • Index-Set Splitting • Run-time Symbolic Resolution • Unified framework to generate vector code • Left to the implementer (get a copy of Allen and Kennedy book)

  25. Node Splitting • Sometimes Renaming fails DO I = 1, N S1: A(I) = X(I+1) + X(I) S2: X(I+1) = B(I) + 32 ENDDO • Recurrence kept intact by renaming algorithm

  26. DO I = 1, N S1: A(I) = X(I+1) + X(I) S2: X(I+1) = B(I) + 32 ENDDO Break critical antidependence Make copy of node from which antidependence emanates DO I = 1, N S1’:X$(I) = X(I+1) S1: A(I) = X$(I) + X(I) S2: X(I+1) = B(I) + 32 ENDDO Recurrence broken Vectorized to X$(1:N) = X(2:N+1) X(2:N+1) = B(1:N) + 32 A(1:N) = X$(1:N) + X(1:N) Node Splitting

  27. Node Splitting • Determining minimal set of critical antidependences is in NP-C • Perfect job of Node Splitting difficult • Heuristic: • Select antidependences • Delete it to see if acyclic • If acyclic, apply Node Splitting

  28. Recognition of Reductions • Sum Reduction, Min/Max Reduction, Count Reduction • Vector ---> Single Element S = 0.0 DO I = 1, N S = S + A(I) ENDDO • Not directly vectorizable

  29. Assuming commutativity and associativity S = 0.0 DO k = 1, 4 SUM(k) = 0.0 DO I = k, N, 4 SUM(k) = SUM(k) + A(I) ENDDO S = S + SUM(k) ENDDO Distribute k loop S = 0.0 DO k = 1, 4 SUM(k) = 0.0 ENDDO DO k = 1, 4 DO I = k, N, 4 SUM(k) = SUM(k) + A(I) ENDDO ENDDO DO k = 1, 4 S = S + SUM(k) ENDDO Recognition of Reductions

  30. Recognition of Reductions • After Loop Interchange DO I = 1, N, 4 DO k = I, min(I+3,N) SUM(k-I+1) = SUM(k-I+1) + A(I) ENDDO ENDDO • Vectorize DO I = 1, N, 4 SUM(1:4) = SUM(1:4) + A(I:I+3) ENDDO

  31. Recognition of Reductions • Properties of Reductions • Reduce Vector/Array to one element • No use of Intermediate values • Reduction operates on vector and nothing else

  32. Index-set Splitting • Subdivide loop into different iteration ranges to achieve partial parallelization • Threshold Analysis [Strong SIV, Weak Crossing SIV] • Loop Peeling [Weak Zero SIV] • Section Based Splitting [Variation of loop peeling]

  33. Threshold Analysis DO I = 1, 20 A(I+20) = A(I) + B ENDDO Vectorize to.. A(21:40) = A(1:20) + B DO I = 1, 100 A(I+20) = A(I) + B ENDDO Strip mine to.. DO I = 1, 100, 20 DO i = I, I+19 A(i+20) = A(i) + B ENDDO ENDDO Vectorize this

  34. Loop Peeling • Source of dependence is a single iteration DO I = 1, N A(I) = A(I) + A(1) ENDDO Loop peeled to.. A(1) = A(1) + A(1) DO I = 2, N A(I) = A(I) + A(1) ENDDO Vectorize to.. A(1) = A(1) + A(1) A(2:N)= A(2:N) + A(1)

  35. Run-time Symbolic Resolution • “Breaking Conditions” DO I = 1, N A(I+L) = A(I) + B(I) ENDDO Transformed to.. IF(L.LE.0) THEN A(L:N+L) = A(1:N) + B(1:N) ELSE DO I = 1, N A(I+L) = A(I) + B(I) ENDDO ENDIF

  36. Run-time Symbolic Resolution • Identifying minimum number of breaking conditions to break a recurrence is in NP-C • Heuristic: • Identify when a critical dependence can be conditionally eliminated via a breaking condition

  37. Putting It All Together • Good Part • Many transformations imply more choices to exploit parallelism • Bad Part • Choosing the right transformation • How to automate transformation selection process? • Interference between transformations

  38. Putting It All Together • Any algorithm which tries to tie all transformations must • Take a global view of transformed code • Know the architecture of the target machine • Goal of our algorithm • Finding ONE good vector loop [works well for most vector register architectures]

  39. Compiler Improvement of Register Usage

  40. Overview • Improving memory hierarchy performance by compiler transformations • Scalar Replacement • Unroll-and-Jam • Saving memory loads & stores • Make good use of the processor registers

  41. DO I = 1, N DO J = 1, M A(I) = A(I) + B(J) ENDDO ENDDO A(I) can be left in a register throughout the inner loop Coloring based register allocation fails to recognize this DO I = 1, N T = A(I) DO J = 1, M T = T + B(J) ENDDO A(I) = T ENDDO All loads and stores to A in the inner loop have been saved High chance of T being allocated a register by the coloring algorithm Motivating Example

  42. Scalar Replacement • Convert array reference to scalar reference to improve performance of the coloring based allocator • Our approach is to use dependences to achieve these memory hierarchy transformations

  43. DO I = 1, N*2 DO J = 1, M A(I) = A(I) + B(J) ENDDO ENDDO Can we achieve reuse of references to B ? Use transformation called Unroll-and-Jam DO I = 1, N*2, 2 DO J = 1, M A(I) = A(I) + B(J) A(I+1) = A(I+1) + B(J) ENDDO ENDDO Unroll outer loop twice and then fuse the copies of the inner loop Brought two uses of B(J) together Unroll-and-Jam

  44. DO I = 1, N*2, 2 DO J = 1, M A(I) = A(I) + B(J) A(I+1) = A(I+1) + B(J) ENDDO ENDDO Apply scalar replacement on this code DO I = 1, N*2, 2 s0 = A(I) s1 = A(I+1) DO J = 1, M t = B(J) s0 = s0 + t s1 = s1 + t ENDDO A(I) = s0 A(I+1) = s1 ENDDO Half the number of loads as the original program Unroll-and-Jam

  45. Is unroll-and-jam always legal? DO I = 1, N*2 DO J = 1, M A(I+1,J-1) = A(I,J) + B(I,J) ENDDO ENDDO Apply unroll-and-jam DO I = 1, N*2, 2 DO J = 1, M A(I+1,J-1) = A(I,J) + B(I,J) A(I+2,J-1) = A(I+1,J) + B(I+1,J) ENDDO ENDDO This is wrong!!! Legality of Unroll-and-Jam

  46. Legality of Unroll-and-Jam • Direction vector in this example was (<,>) • This makes loop interchange illegal • Unroll-and-Jam is loop interchange followed by unrolling inner loop followed by another loop interchange • But does loop interchange illegal imply unroll-and-jam illegal ? NO

  47. Conditions for legality of unroll-and-jam • Definition: Unroll-and-jam to factor n consists of unrolling the outer loop n-1 times and fusing those copies together. • Theorem: An unroll-and-jam to a factor of n is legal iff there exists no dependence with direction vector (<,>) such that the distance for the outer loop is less than n.

  48. Conclusion • We have learned two memory hierarchy transformations: • scalar replacement • unroll-and-jam • They reduce the number of memory accesses by maximum use of processor registers

  49. Loop Fusion • Consider the following: DO I = 1,N A(I) = C(I) + D(I) ENDDO DO I = 1,N B(I) = C(I) - D(I) ENDDO If we fuse these loops, we can reuse operations in registers: DO I = 1,N A(I) = C(I) + D(I) B(I) = C(I) - D(I) ENDDO

  50. Example DO J = 1,N DO I = 1,M A(I,J) = C(I,J)+D(I,J) ENDDO DO I = 1,M B(I,J) = A(I,J-1)-E(I,J) ENDDO ENDDO Fusion and Interchange DO I = 1,M DO J = 1,N B(I,J) = A(I,J-1)-E(I,J) A(I,J) = C(I,J)+D(I,J) ENDDO ENDDO Scalar Replacement We’ve saved (n-1)m loads DO I = 1,M r = A(I,0) DO J = 1,N B(I,J) = r - E(I,J) r = C(I,J) + D(I,J) A(I,J) = r ENDDO ENDDO

More Related