240 likes | 386 Views
Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology. F. Hanning, H. Dutta, W. Tichy, and Jürgen Teich University of Erlangen-Nuremberg, Germany.
E N D
Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology F. Hanning, H. Dutta, W. Tichy, and Jürgen Teich University of Erlangen-Nuremberg, Germany Proceedings of the 18thInternational Parallel and Distributed Processing Symposium (IPDPS’04) Presented by: Luis Ortiz Department of Computer Science The University of Texas at San Antonio
Outline • Overview • The Problem • Reconfigurable Architectures • Design Flow for Regular Mapping • Parallelizing Transformations • Constraints Related to CG Reconfigurable Arrays • Case Study • Results • Conclusions and Future Work
Overview • Constructing a parallel program is equivalent to specifying its execution order • the operations of a program form a set, and its execution order is a binary, transitive and asymmetric relation • the relevant sets are (unions of) Z-polytopes • most of the optimizations may be presented as transformation of the original program • The problem of automatic parallelization • given a set of operations E and a strict total order on it • find a partial order on E such that execution of E under it is determinate and gives the same results as the original program
Overview (cont.) • Defining a polyhedron • a set of linear inequalities: Ax + a ≥ 0 • the polyhedron is the set of all x which satisfies these inequalities • the basic property of a polyhedron is convexity: • if two points a and b belong to a polyhedron, then so all convex combinations • λa + (1 – λ)b, 0 ≤ λ ≤ 1 • a bounded polyhedron is called a polytope
Overview (cont.) • The essence of the polytope model is to apply affine transformations to the iteration spaces of a program • the iteration domain of statement S: Dom(S) = {x | Dsx + ds≥ 0} • Ds and ds are the matrix and constant vector which define the iteration polytope. ds may depend linearly on the structure parameters
Overview (cont.) • Coarse-grained reconfigurable architectures • provide flexibility of software combined with the performance of hardware • but, hardware complexity is a problem due to a lack of mapping tools • Parallelization techniques and compilers • map computationally intensive algorithms efficiently to coarse-grained reconfigurable arrays
The Problem “Mapping a certain class of regular nested loop programs onto a dedicated processor array”
Reconfigurable Architectures • Span a wide range of abstraction levels • from fine-grained Look-Up Table (LUT) based reconfigurable logic devices to distributed and hierarchical systems with heterogeneous reconfigurable components • Efficiency comparison • standard arithmetic is less efficient on fine-grained architectures • due to the large routing area overhead • Few research work which deals with the compilation to coarse-grained reconfigurable architecture
Design Flow for Regular Mapping (cont.) • A piecewise regular algorithm contains N quantified equations • each equation Si[I] is of the form • xi[I] are indexed variables • fi are arbitrary functions • dji ∈ ℤn are constant data dependence vectors, and denote similar arguments • Ii are called index spaces
Design Flow for Regular Mapping (cont.) • Linearly bounded lattice • this set is affinely mapped onto iteration vectors I using an affine transformation • Block pipelining period • time interval between the initiations of two successive problem instances (β)
Parallelizing Transformations • Based on the representation of equations and index spaces several combinations of parallelizing transformations in the polytope model can be applied • Affine Transformations • Localization • Operator Splitting • Exploration of Space-Time Mappings • Partitioning • Control Generation • HDL Generation & Synthesis
Constraints Related to CG Reconfigurable Arrays • Coarse-grained (re)configurable architectures consist of an array of processor elements (PE) • array of processor elements (PE) • one or more dedicated functional units or • one or more arithmetic logic units (ALU) • memory • local memory → register files • memory banks • an instruction memory is required if the PE contains an instruction programmable ALU • interconnect structures • I/O ports • synchronization and reconfiguration mechanisms
Case Study • Regular mapping methodology applied for a matrix multiplication algorithm • target architecture • PACT XPP64-A reconfigurable processor array • 64 ALU-PAEs of 24 bit data with in an 8x8 array • each ALU-PAE contains of three objects • the ALU-PAE • Back-Register-object (BREG) • Forward-Register-object (FREG) • all objects are connected to horizontal routing channels
Case Study (cont.) • RAM-PAE are located in two columns at the left and the right border of the array, two ports for independent r/w operations • RAM can be configured to FIFO mode • each RAM-PAE has a 512x24 bit storage capacity • four independent I/O interfaces located in the corners of the array
Case Study (cont.) Structure of the PACT XPP64-A reconfigurable processor ALU-PAE objects
Case Study (cont.) • Matrix multiplication algorithm • C = A * B • A ∈ ZNxN • B ∈ ZNxN • computations may be represented by a dependence graph (DG) • dependence graphs can be represented in a reduced form • Reduced Dependence Graph: to each edge e = (vi, vj) there is associated a dependence vector dij∈ Zn • virtual Processor Elements (VPEs) are used to map the PE obtained from the design flow to the given architecture
Case Study (cont.) Matrix multiplication algorithm, C-code Matrix multiplication algorithm after parallelization, operator splitting, embedding, and localization
Case Study (cont.) DG of transformed matrix multiplication algorithm N = 2 4 x 4 processor array Reduced dependence graph
Case Study (cont.) • Output data • Ox the output-variable space of variable x of the space-time mapped or partitioned index space • the output can be two-dimensional • the transformed output variables are distributed over the entire array • collect the data from one processor’s line PL and feed them out to an array border • m ∈ Z1xn denote the time instances t ∈ Tx(Pi,j) where the variable x produces an output at processor element Pi,j
Case Study (cont.) • if one of the following conditions holds, output data can be serialized
Case Study (cont.) • Partitioned implementation of the matrix multiplication algorithm Dataflow graph of the LPGS-partitioned matrix multiplication 4 x 4 example Dataflow graph after performing localization inside each file Array implementation of the partitioned example
Results • Both implementations (full-size and partitioned) show optimal utilization of resources • Each configured MAC-unit performs one operation per cycle • It is observed that using fewer resources with better implementation more performance per cycle can be achieved • The number of ALUs is reduced from O(3N) to O(N) • Merging and writing of output data streams is overlapped with computations in PEs
Conclusions and Future Work • The mapping methodology based on loop parallelization in the polytope model provides results that are efficient in terms of utilization of resources and execution time • Future work is focused on perform automatic compilation of nested loop programs