220 likes | 318 Views
Memory-aware application mapping on coarse-grained reconfigurable arrays. Yongjoo Kim, Jongeun Lee * , Aviral Shrivastava ** , Jonghee Yoon and Yunheung Paek. Software Optimization And Restructuring, Department of Electrical Engineering, Seoul National University, Seoul, South Korea.
E N D
Memory-aware application mapping on coarse-grained reconfigurable arrays Yongjoo Kim, Jongeun Lee*, AviralShrivastava**, Jonghee Yoon and Yunheung Paek Software Optimization And Restructuring, Department of Electrical Engineering, Seoul National University, Seoul, South Korea * Embedded Systems Research Lab, ECE, Ulsan Nat’l Institute of Science & Tech, Ulsan, Korea **Compiler and Microarchitecture Lab, Center for Embedded Systems, Arizona State University, Tempe, AZ, USA. 2010-01-25 5th International Conference , HiPEAC 2010
Coarse-Grained Reconfigurable Array (CGRA) • High computation throughput • Low power consumption and scalability • High flexibility with fast configuration * CGRA shows 10~100MIPS/mW SO&R and CML Research Group
Coarse-Grained Reconfigurable Array (CGRA) • Array of PE • Mesh like network • Operate on the result of their neighbor PE • Execute computation intensive kernel SO&R and CML Research Group
Application mapping in CGRA • Mapping DFG on PE array mapping space • Should satisfy several conditions • Should map nodes on the PE which have a right functionality • Data transfer between nodes should be guaranteed • Resource consumption should be minimized for performance SO&R and CML Research Group
CGRA execution & data mapping Configuration Memory tc : computation time, td : data transfer time Local memory PE Bk1 buf1 Bk1 buf2 Main Memory Bk2 buf1 Bk2 buf2 Bk3 buf1 Bk3 buf2 Bk4 buf1 Bk4 buf2 DMA Double buffering Total runtime = max(tc, td)
The performance bottleneck : Data transfer 100% = tc + td • Many multimedia kernels show bigger td than tc • Average ratio of tc : just 22% < The ratio between tc and td > Most applications are memory-bound applications. SO&R and CML Research Group
Computation Mapping & Data Mapping Local memory Duplicate array increase data transfer time LD S[i] LD S[i+1] 0 S[i] 0 1 1 S[i+1] 2 + SO&R and CML Research Group
Contributions of this work • First approach to consider computation mapping and data mapping - balance tc and td - minimize duplicate arrays (maximize data reuse) - balance bank utilization • Simple yet effective extension - a set of cost functions - can be plugged in to existing compilation frameworks - E.g., EMS (edge-based modulo scheduling) SO&R and CML Research Group
Application mapping flow DFG Performance Bottleneck Analysis Data Reuse Analysis DCR DRG Memory-aware Modulo Scheduling Mapping SO&R and CML Research Group
Preprocessing 1 : Performance bottleneck analysis • Determines whether it is computation or data transfer that limits the overall performance • Calculate DCR(data-transfer-to-computation time ratio) DCR = td / tc DCR > 1 : the loop is memory-bound SO&R and CML Research Group
Preprocessing 2 : Data reuse analysis • Find the amount of potential data reuse • Creates a DRG(Data Reuse Graph) • nodes correspond to memory operations and edge weights approximate the amount of reuse • The edge weight is estimated to be TS - rd • TS : the tile size • rd : the reuse distance in iterations S[i+1] S[i] S[i+5] D[i] R2[i] D[i+10] R[i] < DRG> SO&R and CML Research Group
Application mapping flow • DCR & DRG are used for cost calculation DFG Performance Bottleneck Analysis Data Reuse Analysis DCR DRG Memory-aware Modulo Scheduling Mapping SO&R and CML Research Group
Mapping with data reuse opportunity cost x 0 x 40 +20 x 0 50 Bank1 Bank2 x 50 x +20 x 0 x 60 x 60 Local Memory PE PE PE PE PE Array 0 1 A[i] B[i] 6 2 3 0 1 A[i],A[i+1] B[i] A[i+1] x 40 4 5 2 3 30 7 x 50 4 5 6 x 40 x 60 8 7 B[i+1] Data reuse opportunity cost Memory-unaware cost New total cost (memory unaware cost + DROC) 9 SO&R and CML Research Group
BBC(Bank Balancing Cost) A[i] 0 A[i+1] 1 3 2 Bank1 Bank2 4 5 • To prevent allocating all data to just one bank • BBC(b) = β × A(b) β : the base balancing cost(a design parameter) A(b) : the number of arrays already mapped onto bank b 6 PE PE PE PE 7 0 B[i] 2 3 1 4 5 Local Memory 6 A[i],A[i+1] PE Array β : 10 Cand Cand SO&R and CML Research Group
Application mapping flow DFG Performance Bottleneck Analysis Data Reuse Analysis DCR DRG Partial Shutdown Exploration Memory-aware Modulo Scheduling Mapping SO&R and CML Research Group
Partial Shutdown Exploration • For a memory-bound loop, the performance is often limited by the memory bandwidth rather than by computation. ≫ Computation resources are surplus. • Partial Shutdown Exploration • on PE rows and the memory banks • find the best configuration that gives the minimum EDP(Energy-Delay Product) SO&R and CML Research Group
Example of Partial shutdown exploration LD S[i] LD S[i+1] D[…], R[…] 7/7r 8/3 0 1 -/6 5/- -/4 2 S[…] 0/1 2/0r S[…] -/0r/2 0/1/- LD D[i] 3 4 D[…], R[…] 4/-/- -/5/- 3/8/6 7/-/- < 4 row - 2 bank > 5 6 < 2 row - 2 bank > 7 8 ST R[i]
Experimental Setup • A set of loop kernels from MiBench, multimedia, SPEC 2000 benchmarks • Target architecture • 4x4 heterogeneous CGRA(4 memory accessable PE) • 4 memory bank, each connected to each row • Connected to its four neighbors and four diagonal ones • Compared with other mapping flow • Ideal : memory unaware + single bank memory architecture • MU : memory unaware mapping(*EMS) + multi bank memory architecture • MA : memory aware mapping + multi bank memory architecture • MA + PSE : MA + partial shutdown exploration * Edge-centric Modulo Scheduling for Coarse-Grained Reconfigurable Architectures, Hyunchul Park et al, PACT 08 SO&R and CML Research Group
Runtime comparison Compared with MU The MA reduces the runtime by30% SO&R and CML Research Group
Energy consumption comparison MA + PSE shows 47% energy consumption reduction. SO&R and CML Research Group
Conclusion • The CGRA provide very high power efficiency while be software programmable. • While previous solutions have focused on the computation speed, we consider the data transfer to achieve higher performance. • We proposed an effective heuristic that considers memory architecture. • It achieves 62% reduction in the energy-delay product which factors into 47% and 28% reductions in the energy consumption and runtime. SO&R and CML Research Group
Thank you for your attention! SO&R and CML Research Group