1 / 22

5th International Conference , HiPEAC 2010

Memory-aware application mapping on coarse-grained reconfigurable arrays. Yongjoo Kim, Jongeun Lee * , Aviral Shrivastava ** , Jonghee Yoon and Yunheung Paek. Software Optimization And Restructuring, Department of Electrical Engineering, Seoul National University, Seoul, South Korea.

stan
Download Presentation

5th International Conference , HiPEAC 2010

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Memory-aware application mapping on coarse-grained reconfigurable arrays Yongjoo Kim, Jongeun Lee*, AviralShrivastava**, Jonghee Yoon and Yunheung Paek Software Optimization And Restructuring, Department of Electrical Engineering, Seoul National University, Seoul, South Korea * Embedded Systems Research Lab, ECE, Ulsan Nat’l Institute of Science & Tech, Ulsan, Korea **Compiler and Microarchitecture Lab, Center for Embedded Systems, Arizona State University, Tempe, AZ, USA. 2010-01-25 5th International Conference , HiPEAC 2010

  2. Coarse-Grained Reconfigurable Array (CGRA) • High computation throughput • Low power consumption and scalability • High flexibility with fast configuration * CGRA shows 10~100MIPS/mW SO&R and CML Research Group

  3. Coarse-Grained Reconfigurable Array (CGRA) • Array of PE • Mesh like network • Operate on the result of their neighbor PE • Execute computation intensive kernel SO&R and CML Research Group

  4. Application mapping in CGRA • Mapping DFG on PE array mapping space • Should satisfy several conditions • Should map nodes on the PE which have a right functionality • Data transfer between nodes should be guaranteed • Resource consumption should be minimized for performance SO&R and CML Research Group

  5. CGRA execution & data mapping Configuration Memory tc : computation time, td : data transfer time Local memory PE Bk1 buf1 Bk1 buf2 Main Memory Bk2 buf1 Bk2 buf2 Bk3 buf1 Bk3 buf2 Bk4 buf1 Bk4 buf2 DMA Double buffering Total runtime = max(tc, td)

  6. The performance bottleneck : Data transfer 100% = tc + td • Many multimedia kernels show bigger td than tc • Average ratio of tc : just 22% < The ratio between tc and td > Most applications are memory-bound applications. SO&R and CML Research Group

  7. Computation Mapping & Data Mapping Local memory Duplicate array increase data transfer time LD S[i] LD S[i+1] 0 S[i] 0 1 1 S[i+1] 2 + SO&R and CML Research Group

  8. Contributions of this work • First approach to consider computation mapping and data mapping - balance tc and td - minimize duplicate arrays (maximize data reuse) - balance bank utilization • Simple yet effective extension - a set of cost functions - can be plugged in to existing compilation frameworks - E.g., EMS (edge-based modulo scheduling) SO&R and CML Research Group

  9. Application mapping flow DFG Performance Bottleneck Analysis Data Reuse Analysis DCR DRG Memory-aware Modulo Scheduling Mapping SO&R and CML Research Group

  10. Preprocessing 1 : Performance bottleneck analysis • Determines whether it is computation or data transfer that limits the overall performance • Calculate DCR(data-transfer-to-computation time ratio) DCR = td / tc DCR > 1 : the loop is memory-bound SO&R and CML Research Group

  11. Preprocessing 2 : Data reuse analysis • Find the amount of potential data reuse • Creates a DRG(Data Reuse Graph) • nodes correspond to memory operations and edge weights approximate the amount of reuse • The edge weight is estimated to be TS - rd • TS : the tile size • rd : the reuse distance in iterations S[i+1] S[i] S[i+5] D[i] R2[i] D[i+10] R[i] < DRG> SO&R and CML Research Group

  12. Application mapping flow • DCR & DRG are used for cost calculation DFG Performance Bottleneck Analysis Data Reuse Analysis DCR DRG Memory-aware Modulo Scheduling Mapping SO&R and CML Research Group

  13. Mapping with data reuse opportunity cost x 0 x 40 +20 x 0 50 Bank1 Bank2 x 50 x +20 x 0 x 60 x 60 Local Memory PE PE PE PE PE Array 0 1 A[i] B[i] 6 2 3 0 1 A[i],A[i+1] B[i] A[i+1] x 40 4 5 2 3 30 7 x 50 4 5 6 x 40 x 60 8 7 B[i+1] Data reuse opportunity cost Memory-unaware cost New total cost (memory unaware cost + DROC) 9 SO&R and CML Research Group

  14. BBC(Bank Balancing Cost) A[i] 0 A[i+1] 1 3 2 Bank1 Bank2 4 5 • To prevent allocating all data to just one bank • BBC(b) = β × A(b) β : the base balancing cost(a design parameter) A(b) : the number of arrays already mapped onto bank b 6 PE PE PE PE 7 0 B[i] 2 3 1 4 5 Local Memory 6 A[i],A[i+1] PE Array β : 10 Cand Cand SO&R and CML Research Group

  15. Application mapping flow DFG Performance Bottleneck Analysis Data Reuse Analysis DCR DRG Partial Shutdown Exploration Memory-aware Modulo Scheduling Mapping SO&R and CML Research Group

  16. Partial Shutdown Exploration • For a memory-bound loop, the performance is often limited by the memory bandwidth rather than by computation. ≫ Computation resources are surplus. • Partial Shutdown Exploration • on PE rows and the memory banks • find the best configuration that gives the minimum EDP(Energy-Delay Product) SO&R and CML Research Group

  17. Example of Partial shutdown exploration LD S[i] LD S[i+1] D[…], R[…] 7/7r 8/3 0 1 -/6 5/- -/4 2 S[…] 0/1 2/0r S[…] -/0r/2 0/1/- LD D[i] 3 4 D[…], R[…] 4/-/- -/5/- 3/8/6 7/-/- < 4 row - 2 bank > 5 6 < 2 row - 2 bank > 7 8 ST R[i]

  18. Experimental Setup • A set of loop kernels from MiBench, multimedia, SPEC 2000 benchmarks • Target architecture • 4x4 heterogeneous CGRA(4 memory accessable PE) • 4 memory bank, each connected to each row • Connected to its four neighbors and four diagonal ones • Compared with other mapping flow • Ideal : memory unaware + single bank memory architecture • MU : memory unaware mapping(*EMS) + multi bank memory architecture • MA : memory aware mapping + multi bank memory architecture • MA + PSE : MA + partial shutdown exploration * Edge-centric Modulo Scheduling for Coarse-Grained Reconfigurable Architectures, Hyunchul Park et al, PACT 08 SO&R and CML Research Group

  19. Runtime comparison Compared with MU The MA reduces the runtime by30% SO&R and CML Research Group

  20. Energy consumption comparison MA + PSE shows 47% energy consumption reduction. SO&R and CML Research Group

  21. Conclusion • The CGRA provide very high power efficiency while be software programmable. • While previous solutions have focused on the computation speed, we consider the data transfer to achieve higher performance. • We proposed an effective heuristic that considers memory architecture. • It achieves 62% reduction in the energy-delay product which factors into 47% and 28% reductions in the energy consumption and runtime. SO&R and CML Research Group

  22. Thank you for your attention! SO&R and CML Research Group

More Related