1 / 29

Data-centric Subgraph Mapping for Narrow Computation Accelerators

Data-centric Subgraph Mapping for Narrow Computation Accelerators. Amir Hormati, Nathan Clark, and Scott Mahlke Advanced Computer Architecture Lab. University of Michigan. Introduction. Migration of applications Programmability and cost issues in ASIC

paul
Download Presentation

Data-centric Subgraph Mapping for Narrow Computation Accelerators

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data-centric Subgraph Mapping for Narrow Computation Accelerators Amir Hormati, Nathan Clark, and Scott Mahlke Advanced Computer Architecture Lab. University of Michigan

  2. Introduction • Migration of applications • Programmability and cost issues in ASIC • More functionality in the embedded processor 2

  3. What Are the Challenges Accelerator Hardware: Compiler Algorithm: 3

  4. Input2 Input3 Input4 Input1 Output1 Output2 Configurable Compute Array (CCA) • Array of FUs • Arithmetic/logic • 32-bit functional units • Full interconnect between rows • Supports 95 percent of all computation patterns (Nathan Clark, ISCA 2005) 4

  5. Report Card on the Original CCA • Easy to integrate to current embedded systems • High performance gain however... • 32-bit general purpose CCA: • 130nm standard cell library • Area requirement: 0.3mm2 • Latency: 3.3ns die photo of a processor with CCA 5

  6. Objectives of this Work • Redesign of the CCA hardware • Area • Latency • Compilation strategy • Code quality • Runtime 6

  7. Width Utilization • Full width of the FUs is not always needed. • Narrower FUs is not the solution. 7

  8. [8-31] [8-31] [8-31] [8-31] Width Checker Carry bits Iterate Width-Aware Narrow CCA Input Registers [ 0 - 7 ] [ 0 - 7 ] [ 0 - 7 ] [ 0 - 7 ] - [8-31] [8-31] [8-31] [8-31] Iteration Controller Iterate CCA Output Registers Carry Bits Output 2 Output1 8

  9. Input2 Input3 Input4 Input1 Input2 Input3 Input4 Input1 Output1 Output2 Output1 Output2 Sparse Interconnect • Rank wires based on utilization. • >50% wires removed. • 91% of all patterns are supported. 9

  10. Synthesis Results • Synthesized using Synopsys and Encounter in 130nm library. 10

  11. Compilation Challenges • Best portions of the code • Non-uniform latency • What are the current solutions: • Hand coding • Function intrinsics • Greedy solution 11

  12. ADD 3 6 3 3 ADD ADD ADD OR AND AND 7 5 4 XOR ADD 8 6 OR AND XOR ADD AND ADD ADD CMP Step 1: Enumeration Live In Live In 3 5 4 Live In 6 1 Live Out 7 2 8 Live Out Live Out 12

  13. SHL 8 AND 3 << << 8 << << << << * * * * * * * Logic 3 3 3 3 3 3 SUB A A A A A A A B B B B B B B C C C C C C C 6 ADD 11 >> >> 10 10 >> >> >> >> >> >> >> >> 10 >> 6 6 6 +/- 6 +/- 6 D D D D D D D E E E E E E E F F F F F F F +/- +/- +/- +/- +/- +/- +/- 11 +/- +/- 11 11 11 +/- G G G G G G G H H H H H H H Step 2: Subgraph Isomorphism Pruning • Ensure subgraphs can run on accelerator SHRA 10 13

  14. CMP ADD ADD AND ADD OR AND OR XOR ADD CMP ADD AND AND ADD XOR Step 3: Grouping Live In Live In Live In Live In 3 3 E E 5 5 C B C B 4 Live In 4 Live In 6 1 6 1 A A 7 7 Live Out Live Out 2 2 F F AC D D 8 8 Live Out Live Out Live Out Live Out • Assuming A and C are the only possibilities for grouping. 14

  15. Dealing with Non-uniform Latency ADD OR AND 24 bit 8 bit Average Latency =2 Average Latency =2 Average Latency =2 A B C 8 bit 24 bit 8 bit 24 bit Time • >94% do not change width 15

  16. Width Op ID AC D G H … N 24 1 1 1 … 8 2 1 1 … 24 3 1 … 8 4 1 … 32 5 … 32 6 … 8 7 1 … 8 8 1 … 1 Cost 3 1 1 1 … 1 Benefit 1 1 0 0 … 0 Step 4: Unate Covering 16

  17. Experimental Evaluation • ARM port of Trimaran compiler system • Processor model • ARM-926EJS • Single issue, in-order execution, 5 stage pipeline • I/D caches : 16k, 64-way • Hardware simulation: SimpleScalar 4.0 17

  18. Comparison of Different CCAs 16-bit and 8-bit CCAs are 7% and 9% better than 32-bit CCA. • Assuming clock speed(1/(3.3ns) = 300 MHZ) 18

  19. Comparison of Different Algorithms • Previous work: Greedy 10% worse than data-unaware 19

  20. Conclusion • Programmable hardware accelerator • Width-aware CCA: Optimizes for common case. • 64% faster clock • 4.2x smaller • Data-centric compilation: Deals with non-uniform latency of CCA. • Average 6.5%, • Max 12% better than data-unaware algorithm. 20

  21. ? For more information: http://cccp.eecs.umich.edu/ 21

  22. Data-Centric FEU 22

  23. 2 0 0 8 C 1 D 0 1 0 ADD OR 0 0 A B C D 2 2 FU FU ADD OR 1 ADD 0 0 0 2 0 0 8 C 1 D 0 5 1 FU ADD 0 0 ADD OR 0 B C D A 1 0 9 8 0 ADD 1 1 Operation of Narrow CCA [(0x1D + 0x0C) + (0x20 OR 0x08)] 23

  24. Enumeration Pruning Grouping Selection Data-Centric Subgraph Mapping • Enumerate • All subgraphs • Pruning • Subgraph isomorphism • Grouping • Iteratively group disconnected subgraphs • Selection • Unate covering • Shrink search space to control runtime 24

  25. How Good is the Cost Function Almost all of the operands have the same width range through out the execution. 25

  26. 26

  27. Width Utilization • Full width of the FUs is not always needed. • Replacing FUs with narrower FUs is not a good idea by itself. 27

  28. Introduction • Migration of applications • Programmability and cost issues in ASIC • More functionality in the embedded processor 28

  29. What Are the Challenges Accelerator Hardware: Compiler Algorithm: 29

More Related