1 / 21

Scalable Subgraph Mapping for Acyclic Computation Accelerators

Scalable Subgraph Mapping for Acyclic Computation Accelerators. Nate Clark, Amir Hormati, Scott Mahlke, Sami Yehia University of Michigan ARM, Ltd. ASIP Architecture. Tightly integrated, atomic execution Examples: MAC, dot-product, Galois field . W B. I s s u e. F e t c h.

marlin
Download Presentation

Scalable Subgraph Mapping for Acyclic Computation Accelerators

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Scalable Subgraph Mapping forAcyclic Computation Accelerators Nate Clark, Amir Hormati, Scott Mahlke, Sami Yehia University of Michigan ARM, Ltd. 1

  2. ASIP Architecture • Tightly integrated, atomic execution • Examples: MAC, dot-product, Galois field W B I s s u e F e t c h Accel. … … ALU ALU 2

  3. * +/- +/- What Are We Solving? 3

  4. How Do People Solve This Problem? • Hand coding • Greedy algorithms 4

  5. Greedy Algorithms • Locally optimal decisions • Example – partition problem • Divide numbers into two sets with near equal sum Set 1 5 5, 3 Input Sum: 8 4, 3, 3, 5, 3 3, 3, 3 3, 3 3 5, 4, 3, 3, 3 4, 3, 3, 3 Set 2 4 4, 3 4, 3, 3 Sum: 10 5

  6. Greedy Pros / Cons • Cons: room for improvement • Pros: fast, easy to implement Greedy Optimal 5, 3 5, 4 Sum: 8 Sum: 9 4, 3, 3 3, 3, 3 Sum: 10 Sum: 9 6

  7. Greedy Full Compilation Problems • Frequently NP-complete • Scheduling, superblock selection, allocation • Greedy algorithms prevalent 7

  8. Compilation for Acyclic Accelerators • Define target • Describe greedy algorithm • Develop FEU algorithm • Compare runtime, quality 8

  9. Input2 Input3 Input4 Input1 Output1 Output2 Target Accelerator • Array of FUs • Arithmetic/logic • Sparse interconnect • 82% important subgraphs 9

  10. Live In Live In Live In SHR SHL 1 2 AND 5 SHL 8 AND SHR 3 4 MPY 7 SHR 10 SUB 6 ADD 9 ADD 11 SHL 12 SHL 13 SHR 15 SHR 14 CMP 16 Live out BEQ Live out 17 Subgraph Mapping • Select parts of applications to accelerate 10

  11. Live In Live In Live In SHR SHL 1 2 AND 5 SHL 8 AND SHR 3 4 MPY 7 SHR 10 SUB 6 ADD 9 ADD 11 SHL 12 SHL 13 SHR 15 SHR 14 CMP 16 Live out BEQ Live out 17 Subgraph Mapping: 3 Steps • Enumerate • Find candidates • Prune • Remove invalid candidates • Selection • Pick candidates for accel. 11

  12. Live In Live In Live In SHR SHL 1 2 AND 5 SHL 8 AND SHR 3 4 MPY 7 SHR 10 SUB 6 ADD 9 ADD 11 SHL 12 SHL 13 SHR 15 SHR 14 CMP 16 Live out BEQ Live out 17 Greedy Subgraph Mapping Speedup = 17/7 = 2.4 12

  13. Greedy Summary • Enumeration • Restricted • Prune • Unnecessary • Selection • Implicit Live In Live In Live In SHR SHL 1 2 AND 5 SHL 8 AND SHR 3 4 MPY 7 SHR 10 SUB 6 ADD 9 ADD 11 SHL 12 SHL 13 SHR 15 SHR 14 CMP 16 Live out BEQ Live out 17 13

  14. Full Enumeration- Unate Covering (FEU) • Enumerate • All subgraphs • Prune • Subgraph isomorphism • Selection • Unate covering • Shrink search space to control runtime 14

  15. SHR SHL 1 2 SHR SHL 1 2 AND 3 SHL 8 AND 3 ADD 9 SHR 10 SUB SHL 6 12 ADD 11 SHR 14 SHL 13 Full Enumeration Live In Live In Live In SHR SHL 1 2 AND 5 SHL 8 AND SHR 3 4 MPY 7 SHR 10 SUB 6 ADD 9 ADD 11 SHL 12 SHL 13 SHR 15 SHR 14 CMP 16 Live out BEQ Live out 17 15

  16. SHL 8 AND 3 << << 8 << << << << * * * * * * * Logic 3 3 3 3 3 3 SUB A A A A A A A B B B B B B B C C C C C C C 6 ADD 11 >> >> 10 10 >> >> >> >> >> >> >> >> 10 >> 6 6 6 +/- 6 +/- 6 D D D D D D D E E E E E E E F F F F F F F +/- +/- +/- +/- +/- +/- +/- 11 +/- +/- 11 11 11 +/- G G G G G G G H H H H H H H Subgraph Isomorphism Pruning • Ensure subgraphs can run on accelerator SHRA 10 16

  17. Live In Live In Live In Live In Live In Live In SHR SHL 1 2 SHR SHL 1 2 AND 5 AND 5 SHL 8 SHL AND SHR 8 3 4 AND SHR 3 4 MPY 7 MPY 7 SHR 10 SHR SUB 10 6 SUB ADD 6 9 ADD 9 ADD 11 ADD 11 SHL 12 SHL SHL 13 12 SHL 13 SHR 15 SHR SHR 15 14 SHR 14 CMP 16 CMP 16 Live out BEQ Live out 17 Live out BEQ Live out 17 Unate Covering Selection • Place as many ops in as few subgraphs as possible Subgraphs D B Speedup = 17/5 = 3.4 Ops E 17

  18. FEU Runtime 99.5% 98% 18

  19. Mapping Algorithm Performance 19

  20. Greedy Full Conclusions • Greedy algorithms: opportunity! • FEU subgraph mapping • Better: 50% more speedup • Fast: >98% blocks less than 1 second 20

  21. Questions ? ? ? ? ? ? ? ? ? ? ? ? 21

More Related