1 / 28

Synthesis on Distributed-memory System

Synthesis on Distributed-memory System. Test Bed 8-node DEC Alpha Farm with DEC HPF compiler IBM SP2 with HPF compiler nCUBE/2 with 16 nodes. CPU. CPU. CPU. CPU. CPU. Memory. Memory. Memory. Memory. Memory. Interconnection Network. HPF Example. REAL A(N,2*N), B(2*N), C(2*N,2*N)

tirza
Download Presentation

Synthesis on Distributed-memory System

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Synthesis on Distributed-memory System • Test Bed • 8-node DEC Alpha Farm with DEC HPF compiler • IBM SP2 with HPF compiler • nCUBE/2 with 16 nodes CPU CPU CPU CPU CPU Memory Memory Memory Memory Memory Interconnection Network

  2. HPF Example REAL A(N,2*N), B(2*N), C(2*N,2*N) REAL D(2*N,N), E(N), F(N), G(N) !HPF TEMPLATE TEMP(N*N*4,N*N*4) !HPF ALIGN A(i,j) WITH TEMP(4*i-3,4*j-3) !HPF ALIGN B(i) WITH TEMP(*,4*i-3) !HPF ALIGN C(i,j) WITH TEMP(4*j-3,i) !HPF ALIGN D(i,j) WITH TEMP(4*j-3,4*i-3) !HPF ALIGN G(i) WITH TEMP(4*i-3,*) D=C(:,1:4*N:4) E=SUM(A+TRANSPOSE(D),DIM=2) F=SUM(SPREAD(B,DIM=2,NCOPIES=N)+D,DIM=1) G=B(1:2*N:2)+E+F S. Chatterjee et al. “Automatic Array Alignment in Data-Parallel Programs” ACM Symposium on Principles of Programming Languages,1993.

  3. HPF Example (Cont’d) • We execute the codes by DEC HPF Compiler on 8-node DEC Farm with an FDDI network. • Loop 100 times. N is set to 128.

  4. Apply Array Operation Synthesis to Distributed-memory Machines • Owner Computes Rule of HPF • Memory References of Distributed-memory Machines • Local References • Remote References(Communication)

  5. Synthesis Anomaly Array Operation Synthesis may either • Example of Synthesis Anomaly decrease Remote References or GOOD increase Remote References or Require more communication time Synthesis Anomaly !HPF TEMPLATE TEMP(N,N) REAL A(N,N),B(N,N),C(N,N) !HPF ALIGN A(I,J),B(I,J),C(I,J)WITH TEMP(I,J) C=TRANSPOSE(A+B,1)

  6. Evaluating Array Expression Optimal Solution is NP-hard • Except the optimal solution, we also propose a heuristic algorithm. Under the Owner Computes Rule To Synthesize Part of Array Operations To Find Data Layout of Temporary Arrays

  7. A Heuristic to Reduce Synthesis Anomaly Do Array Operation Synthesis Does synthesis increase communication cost? Roll back temporary arrays Do code generation normally

  8. Reference Location !HPF$ ALIGN F(I,J) WITH TEMP(3*I-1,J) • The reference location of F(2*J-1,I) with respect to TEMP is: TEMP(3*(2*J-1)-1,I)=TEMP(6*J-4,I)

  9. Demonstration Example of Heuristic Algorithm !HPF$ TEMPLATE TEMP(300,300) REAL A(100,100),B(100,100),C(100,100) REAL D(100,100),E(100,100),F(200,100), G(300,100) !HPF$ ALIGN A(i,j),B(i,j),C(i,j),D(i,j),E(i,j) with TEMP(i,j) !HPF$ ALIGN F(i,j) with TEMP(3*i-1,j) !HPF$ ALIGN G(i,j) with TEMP(2*i,j) C(1:100,:)=F(1:200:2,:) D(1:100,:)=G(1:300:3,:) E=CSHIFT(TRANSPOSE(A+B),1,1 )*(TRANSPOSE(C)-TRANSPOSE(D) ) Do Array Operation Synthesis FORALL (I=1:99,J=1:100) E(I,J)=(A[J,I+1]+B[J,I+1])*(F[2*J-1,I]-G[3*J-2,I]) END FORALL FORALL (I=100:100,J=1:100) E(I,J)=(A[J,I-99]+B[J,I-99])*(F[2*J-1,I]-G[3*J-2,I]) END FORALL

  10. Heuristic Algorithm (1) E(I,J)=(A[J,I+1]+B[J,I+1])*(F[2*J-1,I]-G[3*J-2,I]) E[I,J] F[2*J-1,I] A[J,I+1] B[J,I+1] G[3*J-2,I]

  11. TEMP[I,J] TEMP[6*J-4,I] TEMP[J,I+1] TEMP[6*J-4,I] TEMP[J,I+1]

  12. SB1 SB2

  13. Heuristic Algorithm (1) E(I,J)=(A[J,I+1]+B[J,I+1])*(F[2*J-1,I]-G[3*J-2,I]) E[I,J] TEMP[I,J] SB1 SB2 F[2*J-1,I] A[J,I+1] B[J,I+1] G[3*J-2,I] TEMP[6*J-4,I] TEMP[J,I+1] TEMP[6*J-4,I] TEMP[J,I+1]

  14. Heuristic Algorithm (2) !HPF$ ALIGN TA1(I,J) WITH TEMP(J,I+1) !HPF$ ALIGN TA2(I,J) WITH TEMP(6*J-4,I) • Create Temporary Arrays FORALL (I=1:99,J=1:100) TA1(I,J) =A(J,I+1)+B(J,I+1) END FORALL Communication Free Loop For Subtree SB1 FORALL (I=1:99,J=1:100) TA2(I,J) =F(2*J-1,I)-G(3*J-2,I) END FORALL For Subtree SB2 Communication Free Loop FORALL (I=1:99,J=1:100) E(I,J)=TA1(I,J)*TA2(I,J) END FORALL

  15. Experimental Results on DEC Workstation Farm (N=128) (Purdue-set Problem 9) (APULE routine electromagnetic scattering problem)

  16. Experimental Results on DEC Workstation Farm (N=128) (Sandia Wave)

  17. Experimental Results on IBM SP2 (N=512) (Purdue-set Problem 9) (APULE routine electromagnetic scattering problem)

  18. Experimental Results on IBM SP2 (N=512) (Sandia Wave)

  19. Experimental Results on nCUBE/2 with 16 processors

  20. Experimental Results on nCUBE/2 with 16 processors

  21. Performance of eight suites on nCUBE/2

  22. Performance of eight suites on nCUBE/2

  23. Array Operation Synthesis in Distributed-memory Machines • Optimal solution is NP-hard • A heuristic algorithm for code generation • Experimental results show speedups from 1.6 to 10.4 for HPF code fragments on DEC alpha farm and IBM SP2 • We demonstrated that it is also profitable in applying AOS to HPF programs

  24. Integrating AOS and Automatic Data Alignment • Data Alignment Directive in HPF is with continuous semantics • The data access patterns after applying AOS may not be continuous • We propose Segmented Alignment !HPF$ ALIGN TA2(I,J) WITH TEMP(6*J-4,I)

  25. Segmented Alignment • To align Arrays in a specified index domain • Implementation of Segmented Alignment • Split an array into several subarrays !HPF$ ALIGN TA2(I,J) WITH TEMP(6*J-4,I) WHEN (I,J) IN (1:N/2:1, N/2:N:1)

  26. Conclusion • The Array Operation Synthesis can handle RESHAPE, SPREAD, CSHIFT, EOSHIFT, TRANSPOSE, MERGE, section movement, reduction operations, and WHERE construct • The measured speedups from real applications between 1.21 and 7.55 in Sequent S27 and SGI Power Challenge. • Experimental results show speedups from 1.6 to 10.4 for HPF code fragments from real applications on DEC alpha Farm and IBM SP2

  27. Future Work • To handle PACK, UNPACK and Matrix Multiplication • Integrating Automatic Data Alignment and AOS • Synthesis for array operation functions which includes message passing codes • Applying AOS toward a more extensive set of data parallel programs

  28. Future Work • To handle PACK, UNPACK and Matrix Multiplication • Synthesis for array operation functions which includes message passing codes • Applying AOS toward a more extensive set of data parallel programs

More Related