450 likes | 617 Views
Design Tradeoffs For Hard and Soft FPGA-based Networks-on-Chip. Mohamed ABDELFATTAH Vaughn BETZ. Outline. 1. Why NoCs on FPGAs?. 2. Hard/soft efficiency gap. 3. Integrating hard NoCs with FPGA. Outline. 1. Why NoCs on FPGAs?. Motivation. Previous Work. 2. Hard/soft efficiency gap.
E N D
Design Tradeoffs For Hard and Soft FPGA-based Networks-on-Chip Mohamed ABDELFATTAH Vaughn BETZ
Outline 1 Why NoCs on FPGAs? 2 Hard/soft efficiency gap 3 Integrating hard NoCs with FPGA
Outline 1 Why NoCs on FPGAs? Motivation Previous Work 2 Hard/soft efficiency gap 3 Integrating hard NoCs with FPGA
1. Why NoCs on FPGAs? Motivation Logic Blocks Switch Blocks Wires Interconnect
1. Why NoCs on FPGAs? Motivation Logic Blocks Switch Blocks • Hard Blocks: • Memory • Multiplier • Processor Wires
1. Why NoCs on FPGAs? Motivation 1600 MHz Hard Interfaces DDR/PCIe .. Logic Blocks 800 MHz Switch Blocks Interconnect still the same • Hard Blocks: • Memory • Multiplier • Processor Wires 200 MHz
1. Why NoCs on FPGAs? Motivation 1600 MHz • Bandwidth requirements for hard logic/interfaces • Timing closure DDR3 PHY and Controller PCIe Controller 800 MHz 200 MHz Gigabit Ethernet
1. Why NoCs on FPGAs? Motivation • Bandwidth requirements for hard logic/interfaces • Timing closure • High interconnect utilization: • Huge CAD Problem • Slow compilation • Power/area utilization DDR3 PHY and Controller PCIe Controller Gigabit Ethernet
1. Why NoCs on FPGAs? Motivation • Bandwidth requirements for hard logic/interfaces • Timing closure • High interconnect utilization: • Huge CAD Problem • Slow compilation • Power/area utilization • Wire speed not scaling: • Delay is interconnect-dominated DDR3 PHY and Controller PCIe Controller Gigabit Ethernet
1. Why NoCs on FPGAs? Motivation • Bandwidth requirements for hard logic/interfaces • Timing closure • High interconnect utilization: • Huge CAD Problem • Slow compilation • Power/area utilization • Wire speed not scaling: • Delay is interconnect-dominated • Low-level interconnect hinders modularity: • Parallel compilation • Partial reconfiguration • Multi-chip interconnect DDR3 PHY and Controller PCIe Controller Gigabit Ethernet
Source: Google Earth Los Angeles Barcelona Keep the “roads”, but add “freeways”. Logic Cluster Hard Blocks
1. Why NoCs on FPGAs? FPGA with NoC NoC • Bandwidth requirements for hard logic/interfaces • Timing closure • High interconnect utilization: • Huge CAD Problem • Slow compilation • Power/area utilization • Wire speed not scaling: • Delay is interconnect-dominated • Low-level interconnect hinders modularity: • Parallel compilation • Partial reconfiguration • Multi-chip interconnect DDR3 PHY and Controller Router forwards data packet PCIe Controller Links Router moves data to local interconnect Routers Gigabit Ethernet
1. Why NoCs on FPGAs? FPGA with NoC • Bandwidth requirements for hard logic/interfaces • Timing closure • High interconnect utilization: • Huge CAD Problem • Slow compilation • Power/area utilization • Wire speed not scaling: • Delay is interconnect-dominated • Low-level interconnect hinders modularity: • Parallel compilation • Partial reconfiguration • Multi-chip interconnect DDR3 PHY and Controller PCIe Controller • High bandwidth endpoints known • Pre-design NoC to requirements Gigabit Ethernet • NoC links are “re-usable” • Latency-tolerant communication • NoC abstraction favors modularity
1. Why NoCs on FPGAs? FPGA with NoC • Bandwidth requirements for hard logic/interfaces • Timing closure • High interconnect utilization: • Huge CAD Problem • Slow compilation • Power/area utilization • Wire speed not scaling: • Delay is interconnect-dominated • Low-level interconnect hinders modularity: • Parallel compilation • Partial reconfiguration • Multi-chip interconnect DDR3 PHY and Controller PCIe Controller Gigabit Ethernet • Latency-tolerant communication • NoC abstraction favors modularity
1. Why NoCs on FPGAs? Hard vs. Soft • Implementation options: • Soft Logic (LUTs, .. ) • Hard Logic (unchangeable) • Mixed Soft/Hard DDR3 PHY and Controller PCIe Controller Soft NoC Hard NoC Efficiency Configurability Gigabit Ethernet Investigate the hard vs. soft tradeoff for NoCs (area/delay)
1. Why NoCs on FPGAs? Previous Work • FPGA-tuned Soft NoCs: • LiPar (2005), NoCeM (2008), Connect (2012) • Hard NoCs: • Francis and Moore (2008): Exploring Hard and Soft Networks-on-Chip for FPGAs • Applications that leverage NoCs: • Chung et al. (2011): CoRAM: An In-Fabric Memory Architecture for FPGA-based Computing • Our Contributions: • Quantify area/performance gap of hard and soft NoCs • Investigate how this impacts NoC design (hard/soft) • Integrate hard NoC with FPGA fabric
Outline 1 Why NoCs on FPGAs? 2 Hard/soft efficiency gap Results NoC Architecture Methodology Soft NoC design Area/Speed Efficiency Gap 3 Integrating hard NoCs with FPGA
2. Hard/Soft Efficiency Router Microarchitecture • NoC =Routers+ Links • State-of-the-art router architecture from Stanford: • Acknowledge that the NoC community have excelled at building a router: We just use it • To meet FPGA bandwidth requirements: High-performance router • A complex router includes a superset of NoC components that may be used: More complete analysis • Split router into 5 Components
2. Hard/Soft Efficiency Router – 5 Components
2. Hard/Soft Efficiency Router – 5 Components Multi-Queue Buffer = Memory +CIControl Logic • Port Width • Buffer depth • Number of VCs Input Modules
2. Hard/Soft Efficiency Router – 5 Components Multiplexers Logic + crowded interconnect • Port Width • Number of Ports Crossbar
2. Hard/Soft Efficiency Router – 5 Components Retiming Register Registers + little control logic • Port Width • Number of VCs Output Modules
2. Hard/Soft Efficiency Router – 5 Components Allocators Arbiters = Logic + Registers • Number of Ports • Number of VCs
2. Hard/Soft Efficiency Design Space 4 Parameters 5 Components Port Width Input Module Crossbar Number of Ports Number of VCs Output Module Buffer Depth VC Allocator SW Allocator
2. Hard/Soft Efficiency Methodology • Post-routing FPGA (soft) area and delay • Post-synthesis ASIC (hard) area and delay • Both TSMC 65 nm technology (Stratix III) • Verify results against previous FPGA:ASIC comparison by Kuon and Rose Per Router Component
2. Hard/Soft Efficiency 3 Options for Buffer on FPGA • Relatively small memories • Critical component in router design • 3 options for FPGA: One per LUT Registers 640 bits LUTRAM 9 Kbits Block RAM • Area of each implementation option
2. Hard/Soft Efficiency 3 Options for Buffer on FPGA Width = 32 Bits Another logic cluster used
2. Hard/Soft Efficiency 3 Options for Buffer on FPGA • Relatively small memories • 3 options for implementation on FPGA One per LUT 0.77 Kbit/mm2 Registers 640 bits 23 Kbit/mm2 LUTRAM 9 Kbits 142 Kbit/mm2 Block RAM • 16% utilized BRAM more area efficient than fully used LUTRAM (Valid for Stratix III) • LUTRAM could win for some points in other FPGAs Use BRAM for FPGA (soft) implementation Soft
2. Hard/Soft Efficiency Results – High Port Count 60X – 170X 24X – 94X High port count inefficient in soft Soft
2. Hard/Soft Efficiency Results – Width 72X 26X – 17X High port count inefficient in soft Width scales better Soft
2. Hard/Soft Efficiency Results – Deep Buffers Filling up the BRAM Buffer depth is free on FPGAs when using BRAM Soft
2. Hard/Soft Efficiency Soft Router Design • Design recommendations based on FPGA silicon area • Supported by delay measurements Buffer depth is free on FPGAs when using BRAM High port count inefficient in soft Width scales better Use BRAM for FPGA (soft) implementation Soft Soft Soft
2. Hard/Soft Efficiency Results – Area Memory = Logic + Registers
2. Hard/Soft Efficiency Results – Delay
Outline 1 Why NoCs on FPGAs? 2 Hard/soft efficiency gap 3 Integrating hard NoCs with FPGA Hard NoC + FPGA Wiring Conclusion Future Work
3. Hard NoC with FPGA What to harden? 40% 50% Total Area Critical Path 10% Results suggest hardening Crossbar and Allocators Mixed hard/soft implementation
3. Hard NoC with FPGA Mixed Implementation • For a typical router .. • 5 ports • 32 bits wide • 2 VCs • 10 buffer words Mixed not worth hardening How to connect hard and soft? How efficient is mixed/hard after doing that? ? ? Hard Soft
3. Hard NoC with FPGA Integrating a Hard Router Logic clusters FPGA Router Router Logic • Same I/O mux structure as a logic block – 9X the area • Conventional FPGA interconnect between routers
3. Hard NoC with FPGA Integrating a Hard Router FPGA 730 MHz Router • Same I/O mux structure as a logic block – 9X the area • Conventional FPGA interconnect between routers
3. Hard NoC with FPGA Integrating a Hard Router FPGA Router Assumed a mesh Can form any topology
3. Hard NoC with FPGA Integrating a Hard Router 64-node NoC on Stratix V Router Provides 47 GB/s peak bisection bandwidth Very Cheap! Less than cost of 3 soft nodes Hard NoC + Soft Interconnect is very compelling
1 Why NoCs on FPGAs? • Big city needs freeways to handle traffic • Solve communication problems for a large/heterogeneous FPGA: • Timing Closure – Interconnect Scaling – Modular Design 2 Hard/soft efficiency gap • A hard NoC is on average 30X smaller and 3.6X faster than soft • Crossbars and allocators worst – Input buffer best • An efficient soft NoC: • Uses BRAMs – Large width, low Port Count – Deep buffers 3 Integrating hard NoCs with FPGA • Mixed implementation does not make sense • Integrated fully hard NoC with FPGA fabric (for NoC Links) • 22X area improvement over soft • Reaches max. FPGA frequency (4.7X faster than soft) • 64-node NoC = 0.6% of total FPGA area (Stratix V)
3. Hard NoC with FPGA Future Work • Power analysis • More hardening: • Dedicated inter-router links (hard wires) • Clock domain crossing hardware • How do traffic hotspots (DDR/PCIe) influence NoC design? • Latency insensitive design methodology that uses NoC • CAD tool changes for a NoC-based FPGA
Thank You! mohamed@eecg.utoronto.ca