1 / 57

Kilo-NOC: A Network-on-Chip Architecture for Scalability and Service Guarantees

Kilo-NOC: A Network-on-Chip Architecture for Scalability and Service Guarantees. Boris Grot The University of Texas at Austin. Technology Trends. Xeon Nehalem-EX. Core i7. Pentium D. Pentium 4. Transistor count. 486. Pentium. 386. 286. 8086. 4004. Year of introduction.

annis
Download Presentation

Kilo-NOC: A Network-on-Chip Architecture for Scalability and Service Guarantees

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Kilo-NOC: A Network-on-Chip Architecture for Scalability and Service Guarantees Boris Grot The University of Texas at Austin

  2. Technology Trends Xeon Nehalem-EX Core i7 Pentium D Pentium 4 Transistor count 486 Pentium 386 286 8086 4004 Year of introduction

  3. Technology Applications

  4. Networks-on-Chip (NOCs) • The backbone of highly integrated chips • Transport of memory, operand, and control traffic • Structured, packet-based, multi-hop networks • Increasing importance with greater levels of integration • Major impact on chip performance, energy, and area • TRIPS: 28% performance losson SPEC 2K in NOC • Intel Polaris: 28% of chip power consumption in NOC Moving data is more expensive [energy-wise] than operating on it - William Dally, SC ‘10

  5. On-chip vs Off-chip Interconnects • Topology • Routing • Flow control • Pins • Bandwidth • Power • Area

  6. Future NOC Requirements • 100’s to 1000’s of network clients • Cores, caches, accelerators, I/O ports, … • Efficient topologies • High performance, small footprint • Intelligent routing • Performance through better load balance • Light-weight flow control • High performance, low buffer requirements • Service Guarantees • cloud computing, real-time apps demand QOS support under submission HPCA ‘09 HPCA ‘08 under submission MICRO ‘09

  7. Outline • Introduction • Service Guarantees in Networks-on-Chip • Motivation • Desiderata, prior work • Preemptive Virtual Clock • Evaluation highlights • Efficient Topologies for On-chip Interconnects • Kilo-NOC: A Network for 1000+ Nodes • Summary and Future Work

  8. Why On-chip Quality-of-Service? • Shared on-chip resources • Memory controllers, accelerators, network-on-chip • … require QOS support • fairness, service differentiation, performanceisolation • End-point QOS solutions are insufficient • Data has to traverse the on-chip network • Need QOS support at the interconnect level Hard guarantees in NOCs

  9. NOC QOS Desiderata • Fairness • Isolation of flows • Bandwidth efficiency • Low overhead: • delay • area • energy

  10. Conventional QOS Disciplines • Fixed schedule • Pros: algorithmic and implementation simplicity • Cons: inefficient BW utilization; per-flow queuing • Example: Round Robin • Rate-based • Pros: fine-grained scheduling; BW efficient • Cons: complex scheduling; per-flow queuing • Example: Weighted Fair Queuing (WFQ) [SIGCOMM ‘89] • Frame-based • Pros: good throughput at modest complexity • Cons: throughput-complexity trade-off; per-flow queuing • Example: Rotating Combined Queuing (RCQ) [ISCA ’96] • Per-flow queuing • Area overhead • Energy overhead • Delay overhead • Scheduling complexity

  11. Preemptive Virtual Clock (PVC) [HPCA ‘09] • Goal: high-performance, cost-effective mechanism for fairness and service differentiation in NOCs. • Full QOS support • Fairness, prioritization, performance isolation • Modest area and energy overhead • Minimal buffering in routers & source nodes • High Performance • Low latency, good BW efficiency

  12. PVC: Scheduling • Combines rate-based and frame-based features • Rate-based: evolved from Virtual Clock[SIGCOMM ’90] • Routers track each flow’s bandwidth consumption • Cheap priority computation • f (provisioned rate, consumed BW) • Problem: history effect Flow X

  13. PVC: Scheduling • Combines rate-based and frame-based features • Rate-based: evolved from Virtual Clock[SIGCOMM ’90] • Routers track each flow’s bandwidth consumption • Cheap priority computation • f (provisioned rate, consumed BW) • Problem: history effect • Framing: PVC’s solution to history effect • Frame rollover clears all BW counters • Fixed frame duration

  14. PVC: Scheduling • Combines rate-based and frame-based features • Rate-based: evolved from Virtual Clock[SIGCOMM ’90] • Routers track each flow’s bandwidth consumption • Cheap priority computation • f (provisioned rate, consumed BW) • Problem: history effect Frame roller - BW counters reset - Priorities reset Flow X

  15. PVC: Freedom from Priority Inversion • PVC: simple routers w/o per-flow buffering and no BW reservation • Problem: high priority packets may be blocked by lower priority packets (priority inversion) x

  16. PVC: Freedom from Priority Inversion • PVC: simple routers w/o per-flow buffering and no BW reservation • Problem: high priority packets may be blocked by lower priority packets (priority inversion) • Solution: preemption of lower priority packets `

  17. PVC: Preemption Recovery • Retransmission of dropped packets • Buffer outstanding packets at the source node • ACK/NACK protocol via a dedicated network • All packets acknowledged • Narrow, low-complexity network • Lower overhead than timeout-based recovery • 64 node network: 30-flit backup buffer per node suffices

  18. PVC: Preemption Throttling • Relaxed definition of priority inversion • Reduces preemption frequency • Small fairness penalty • Per-flow bandwidth reservation • Flits within the reserved quota are non-preemptible • Reserved quota is a function of rate and frame size • Coarsened priority classes • Mask out lower-order bits of each flow’s BW counter • Induces coarser priority classes • Enables a fairness/throughput trade-off

  19. PVC: Guarantees • Minimum Bandwidth • Based on reserved quota • Fairness • Subject to BW counter resolution • Worst-case Latency • Packet enters source buffer in frame N, guaranteed delivery by the end of frame N+1

  20. Performance Isolation

  21. Performance Isolation • Baseline NOC • No QOS support • Globally Synchronized Frames (GSF) • J. Lee, et al. ISCA 2008 • Frame-based scheme adapted for on-chip implementation • Source nodes enforce bandwidth quotas via self-throttling • Multiple frames in-flight for performance • Network prioritizes packets based on frame number • Preemptive Virtual Clock (PVC) • Highest fairness setting (unmasked bandwidth counters)

  22. Performance Isolation

  23. PVC Summary • Full QOS support • Fairness & service differentiation • Strong performance isolation • High performance • Inelaborate routers  low latency • Good bandwidth efficiency • Modest area and energy overhead • 3.4 KB of storage per node (1.8x no-QOS router) • 12-20% extra energy per packet

  24. PVC Summary • Full QOS support • Fairness & service differentiation • Strong performance isolation • High performance • Inelaborate routers  low latency • Good bandwidth efficiency • Modest area and energy overhead • 3.4 KB of storage per node (1.8x no-QOS router) • 12-20% extra energy per packet Will it scale to 1000 nodes?

  25. Outline • Introduction • Service Guarantees in Networks-on-Chip • Efficient Topologies for On-chip Interconnects • Mesh-based networks • Toward low-diameter topologies • Multidrop Express Channels • Kilo-NOC: A Network for 1000+ Nodes • Summary and Future Work

  26. NOC Topologies • Topology is the principal determinant of network performance, cost, and energy efficiency • Topology desiderata • Rich connectivity  reduces router traversals • High bandwidth  reduces latency and contention • Low router complexity  reduces area and delay • On-chip constraints • 2D substrates limit implementable topologies • Logic area/energy constrains use of wire resources • Power constrains restrict routing choices

  27. 2-D Mesh • Pros • Low design & layout complexity • Simple, fast routers

  28. 2-D Mesh • Pros • Low design & layout complexity • Simple, fast routers • Cons • Large diameter • Energy & latency impact

  29. Concentrated Mesh(Balfour & Dally, ICS ‘06) • Pros • Multiple terminals at each node • Fast nearest-neighbor communication via the crossbar • Hop count reduction proportional to concentration degree • Cons • Benefits limited by crossbar complexity

  30. Flattened Butterfly (Kim et al., Micro ‘07) • Objectives: • Improve connectivity • Exploit the wire budget

  31. Flattened Butterfly Point-to-point links Nodes fully connected in each dimension

  32. Flattened Butterfly • Pros • Excellent connectivity • Low diameter: 2 hops • Cons • High channel count: k2/2 per row/column • Low channel utilization • Control complexity

  33. Multidrop Express Channels (MECS) [Grot et al., Micro ‘09] • Objectives: • Connectivity • More scalable channel count • Better channel utilization

  34. Multidrop Express Channels (MECS) • Point-to-multipoint channels • Single source • Multiple destinations • Drop points: • Propagate further -OR- • Exit into a router

  35. Multidrop Express Channels (MECS)

  36. Multidrop Express Channels (MECS) • Pros • One-to-many topology • Low diameter: 2 hops • k channels row/column • I/O asymmetry • Cons • I/O asymmetry • Control complexity

  37. MECS Summary • MECS: a novel one-to-many topology • Excellent connectivity • Effective wire utilization • Good fit for planar substrates • Results summary • MECS: lowest latency, high energy efficiency • Mesh-based topologies: best throughput • Flattened butterfly: smallest router area

  38. Outline • Introduction • Service Guarantees in Networks-on-Chip • Efficient Topologies for On-chip Interconnects • Kilo-NOC: A Networks for 1000+ Nodes • Requirements and obstacles • Topology-centric Kilo-NOC architecture • Evaluation highlights • Summary and Future Work

  39. Scaling to a kilo-node NOC • Goal: a NOC architecture that scales to 1000+ clients with good efficiency and strong guarantees • MECS scalability obstacles • Buffer requirements: more ports, deeper buffers  area, energy, latency overheads • PVC scalability obstacles • Flow state, other storage area, energy overheads • Preemption overheads energy, latency overheads • Prioritization and arbitration  latency overheads

  40. Scaling to a kilo-node NOC • Goal: a NOC architecture that scales to 1000+ clients with good efficiency and strong guarantees • MECS scalability obstacles • Buffer requirements: more ports, deeper buffers  area, energy, latency overheads • PVC scalability obstacles • Flow state, other storage area, energy overheads • Preemption overheads energy, latency overheads • Prioritization and arbitration  latency overheads Kilo-NOC: Addresses topology and QOS scalability bottlenecks This talk: reducing QOS overheads

  41. NOC QOS: Conventional Approach Multiple virtual machines (VMs) sharing a die Shared resources (e.g., memory controllers) VM-private resources (cores, caches)

  42. NOC QOS: Conventional Approach NOC contention scenarios: • Shared resource accesses • memory access • Intra-VM traffic • shared cache access • Inter-VM traffic • VM page sharing

  43. NOC QOS: Conventional Approach NOC contention scenarios: • Shared resource accesses • memory access • Intra-VM traffic • shared cache access • Inter-VM traffic • VM page sharing

  44. NOC QOS: Conventional Approach NOC contention scenarios: • Shared resource accesses • memory access • Intra-VM traffic • shared cache access • Inter-VM traffic • VM page sharing

  45. NOC QOS: Conventional Approach NOC contention scenarios: • Shared resource accesses • memory access • Intra-VM traffic • shared cache access • Inter-VM traffic • VM page sharing Network-wide guarantees without network-wide QOS support

  46. Kilo-NOC QOS: Topology-centric Approach • Dedicated, QOS-enabled regions • Rest of die: QOS-free • A richly-connected topology (MECS) • Traffic isolation • Special routing rules • Ensure interference freedom QOS-free

  47. Kilo-NOC QOS: Topology-centric Approach • Dedicated, QOS-enabled regions • Rest of die: QOS-free • A richly-connected topology (MECS) • Traffic isolation • Special routing rules • Ensure interference freedom

  48. Kilo-NOC QOS: Topology-centric Approach • Dedicated, QOS-enabled regions • Rest of die: QOS-free • A richly-connected topology (MECS) • Traffic isolation • Special routing rules • Ensure interference freedom

  49. Kilo-NOC QOS: Topology-centric Approach • Dedicated, QOS-enabled regions • Rest of die: QOS-free • A richly-connected topology (MECS) • Traffic isolation • Special routing rules • Ensure interference freedom

  50. Kilo-NOC QOS: Topology-centric Approach • Dedicated, QOS-enabled regions • Rest of die: QOS-free • A richly-connected topology (MECS) • Traffic isolation • Special routing rules • Ensure interference freedom

More Related