540 likes | 798 Views
Router Architecture. Z. Lu / A. Jantsch / I. Sander Dally and Towles, Chapter 16, 17. Overview. Interconnect Network Introduction. Topology (Regular, irregular). Deadlock, Livelock. Router Architecture (pipelined, classic). Routing (algorithms, mechanics). Network Interface
E N D
Router Architecture Z. Lu / A. Jantsch / I. Sander Dally and Towles, Chapter 16, 17
Overview Interconnect Network Introduction Topology (Regular, irregular) Deadlock, Livelock Router Architecture (pipelined, classic) Routing (algorithms, mechanics) Network Interface (message passing, shared memory) Flow Control (Circuit, packet switching (SAF, wormhole, virtual channel) Performance Analysis and QoS Implementation Evaluation Summary Concepts SoC Architecture
S S S S S S S S S S S S S S S S T T T T T T T T T T T T T T T T Network-on-Chip • Information in the form of packetsis routed via channels and switches from one terminal node to another • The interface between the interconnection network and the terminals (client) is called network interface SoC Architecture
Router Architecture: First thinking questions • Functions • What functions does a router must realize? • Wormhole router without virtual channels • Virtual channel routers • What are the minimum functions? • Modules • What functional blocks should a router have to implement the required functions? • What functions are on the data path, on the control path? • What are the minimum functional units? SoC Architecture
Router Architecture • The discussion concentrates on a typical virtual-channel router • Modern routers are pipelined and work at the flit level • Head flits proceed through buffer stages that perform routing and virtual channel allocation • All flits pass through switch allocation and switch traversal stages • Most routers use credits to allocate buffer space SoC Architecture
A typical virtual channel router • A router’s functional blocks can be divided into • Datapath: handles storage and movement of a packets payload • Input buffers • Switch • Output buffers • Control Plane: coordinating the movements of the packets through the resources of the datapath • Route Computation • VC Allocator • Switch Allocator This is a generic model. Can we skip the output buffers or input buffers? SoC Architecture
A typical virtual channel router • The input unit • contains a set of flit buffers • Maintains the state for each virtual channel • G = Global State • R = Route • O = Output VC • P = Pointers • C = Credits SoC Architecture
Virtual channel state fields (Input) SoC Architecture
A typical virtual channel router • During route computation the output port for the packet is determined • Then the packet requests an output virtual channel from the virtual-channel allocator SoC Architecture
A typical virtual channel router • Flits are forwarded via the virtual channel by allocating a time slot on the switch and output channel using the switch allocator • Flits are forwarded to the appropriate output during this time slot • The output unit forwards the flits to the next router in the packet’s path SoC Architecture
Virtual channel state fields(Output) SoC Architecture
Packet Rate and Flit Rate • The control of the router operates at two distinct frequencies • Packet Rate (performed once per packet) • Route computation • Virtual-channel allocation • Flit Rate (performed once per flit) • Switch allocation • Pointer and credit count update SoC Architecture
The Router Pipeline • A typical router pipeline includes the following stages • RC (Routing Computation) • VC (Virtual Channel Allocation) • SA (Switch Allocation) • ST (Switch Traversal) no pipeline stalls Do all types of flits experience the four stages? Why? Can we design the pipeline in less than 4 stages? SoC Architecture
The Router Pipeline • Cycle 0 • Head flit arrives and the packet is directed to an virtual channel of the input port (G = I) no pipeline stalls SoC Architecture
The Router Pipeline • Cycle 1 • Routing computation • Virtual channel state changes to routing (G = R) • Head flit enters RC-stage • First body flit arrives at router no pipeline stalls SoC Architecture
The Router Pipeline • Cycle 2: Virtual Channel Allocation • Route field (R) of virtual channel is updated • Head flit enters VA state • First body flit enters RC stage • Second body flit arrives at router no pipeline stalls SoC Architecture
The Router Pipeline • Cycle 2: Virtual Channel Allocation • The result of the routing computation is input to the virtual channel allocator • If successful, the allocator assigns a single output virtual channel • The state of the virtual channel is set to active (G = A) no pipeline stalls SoC Architecture
The Router Pipeline • Cycle 3: Switch Allocation • All further processing is done on a flit base • Head flit enters SA stage • Any active VA (G = A) that contains buffered flits (indicated by P) and has downstream buffers available (C > 0) bids for a single-flit time slot through the switch from its input VC to the output VC no pipeline stalls SoC Architecture
The Router Pipeline • Cycle 3: Switch Allocation • If successful, pointer field is updated • Credit field is decremented no pipeline stalls SoC Architecture
The Router Pipeline • Cycle 4: Switch Traversal • Head flit traverses the switch • Cycle 5: • Head flit starts traversing the channel to the next router no pipeline stalls SoC Architecture
The Router Pipeline • Cycle 7: • Tail traverses the switch • Output VC set to idle • Input VC set to idle (G = I), if buffer is empty • Input VC set to routing (G = R), if another head flit is in the buffer no pipeline stalls SoC Architecture
The Router Pipeline • Only the head flits enter the RC and VC stages • The body and tail flits are stored in the flit buffers until they can enter the SA stage no pipeline stalls How the timing diagram looks like if the pipeline is stalled? What are the circumstances when the pipeline will stall? SoC Architecture
Pipeline Stalls • Pipeline stalls can be divided into • Packet stalls • can occur if the virtual channel cannot advance to its R, V, or A state • Flit stalls • If a virtual channel is in active state and the flit cannot successfully complete switch allocation due to • Lack of flit • Lack of credit • Losing arbitration for the switch time slot SoC Architecture
Example for Packet Stall Virtual-channel allocation stall Head flit of A can first enter the VA stage when the tail flit of packet B completes switch allocation and releases the virtual channel SoC Architecture
Example for Flit Stalls Switch allocation stall Second body flit fails to allocate the requested connection in cycle 5 SoC Architecture
Example for Flit Stalls Buffer empty stall Body flit 2 is delayed three cycles. However, since it does not have to enter the RC and VA stage the output is only delayed one cycle! SoC Architecture
Credits • A buffer is allocated in the SA stage on the upstream (transmitting) node. • To reuse the buffer, a credit is returned over a reverse channel after the same flit departs the SA stage of the downstream (receiving) node. • When the credit reaches the input unit of the upstream node, the buffer is available and then can be reused. SoC Architecture
Credits • The credit loop can be viewed by means of a token that • Starting at the SA stage of the upstream node • Traveling downwards with the flit • Reaching the SA stage at the downstream node • Returning upstream as a credit SoC Architecture
Credit Loop Latency • The credit loop latency tcrt, expressed in flit times, gives a lower bound on the number of flit buffers needed on the upstream size for the channel to operate with full bandwidth • tcrt in flit times is given by tcrt = tf + tc + 2Tw + 1 Why plus 1 here? Flit pipeline delay One-way wire delay Credit pipeline delay SoC Architecture
Credit Round-trip Time and Credit Stall Virtual Channel Router with 4 flit buffers tf TW TW tf tf tf tc TW TW tc tf = 4 tc = 2 Tw = 2 => tcrt = 11 Credit Transmit Credit Update tcrt White: upstream pipeline stages Grey: downstream pipeline stages What if the vrtual channel has 5 flit buffers? When the pipeline stall starts, for how many cycles? SoC Architecture
Credit Loop Latency • If the number of buffers available per virtual channel is F, the duty factor of the channel will be d = min (1, F/ tcrt) • The duty factor will be 100% as long as there are sufficient flit buffers to cover the round trip latency SoC Architecture
Flit and Credit Encoding • Flits and credits are sent over separated lines with separate width • Flits and credits are transported via the same line. This can be done by • Including credits into flits • Multiplexing flits and credits at phit level SoC Architecture
Network Interface Z. Lu / A. Jantsch / I. Sander Dally and Towles, Chapter 20
S S S S S S S S S S S S S S S S T T T T T T T T T T T T T T T T Network-on-Chip • Information in the form of packetsis routed via channels and switches from one terminal node to another • The interface between the interconnection network and the terminals (client) is called network interface SoC Architecture
Network Interface Network • Different terminals with different interfaces shall be connected to the network • The network uses a specific protocol and all traffic on the network has to comply to the format of this protocol Switch Network Interface Terminal Node (Resource) SoC Architecture
Network Interface • The network interface plays an important role in a network-on-chip • it shall translate between the terminal protocol and the protocol of the network • it shall enable the client to communicate at the speed of the network • it shall not further reduce the available bandwidth of the network • it shall not increase the latency imposed by the network • A poorly designed network interface is a bottleneck and can increase the latency considerably SoC Architecture
Network Interfaces • For message passing: symmetric • Processor-Network Interface, • For shared memory: un-symmetric, load & store • Processor-Network Interface • Memory-Network Interface • Line-card interface connecting an external network channel with an interconnection network used as a switching fabric What are the differences: message passing and shared memory communication? SoC Architecture
Network Interfaces for message passing • Two-register interface • Descriptor-based interface • Message reception SoC Architecture
Two-Register Interface • For sending, the processor writes to a specific Net-out register • For receiving, the processor reads a specific Net-in register • Pro: • Efficient for short messages • Cons: • Inefficient for long messages • Processor acts as DMA controller • Not safe, because it does not prevent the network from SW running on the processor • A misbehaving processor can send the first part of a message and then delay indefinitely sending the end of the message. • A processor can tie up the network by failing to read a message from the input register. R0 R1 : : R31 Net out Net in Network SoC Architecture
Descriptor Based Interface • The processor composes a message in a set of dedicated message descriptor registers • Each descriptor contains • An immediate value, or • A reference to a processor register, or • A reference to a block of memory • A co-processorsteps through the descriptors and composes the messages • Safe because the network is protected from the processor’s SW Send Start Immediate RN Addr Length END R0 R1 : + Memory RN : : : R31 : : : SoC Architecture
Receiving Messages • A co-processor or a dedicated thread is triggered upon reception of an incoming message • It unpacks the message and stores it in local memory • It informs the receiving task via an interrupt or a status register update How does a processor know that something happens at an I/O device? SoC Architecture
Shared Memory Interfaces • The interconnection network is used to transmit memory read/write transactions between processors and memories • We will further discuss • Processor-Network Interface • Memory-Network Interface What shared memory communication does? SoC Architecture
Processor-Network Interface Load/store requests are stored in request register. Type: read/write, cacheable or uncacheable etc. • Requests are tagged, usually encoding how the reply is to be handled, e.g., store in register R10. • In case of a cache miss, requests are stored in MSHR (miss status holding register) SoC Architecture
Processor-Network Interface Consider a read operation: • Uncacheable read request would result in a pending read • After forming and transmitting the message, its status changes to read requested • When the network returns the message, its status changes to read complete • Completed MSHRs are forwarded to reply register, its status changes to idle SoC Architecture
Processor-Network Interface • Cache coherence protocols change the operation of the processor-network interface • Complete cache lines are loaded into the cache • Protocol requires a larger vocabulary of messages • Exclusive read request • Invalidation and updating of cache lines • Cache coherence protocol requires interface to send messages and update state in response to received messages. How will cache change the operation of the processor-network interface? SoC Architecture
Memory-Network Interface • Interfaces receive memory request messages and sends replies. • Messages received from the network are stored in the TSHR (transaction status holding register). SoC Architecture
Memory-Network Interface • Request queue is used to hold request messages, when all THSRs are busy. • THSR tracks messages in the same way as MHSR • Bank Control and Message Transmit Unit monitors changes in THSR Is a reply queue needed here? Why? SoC Architecture
Memory-Network Interface Consider a read operation: • A read request initializes a TSHR entry with status read pending • Subsequent memory access changes status to bank activated • Right before the first word is returned from memory bank, its status is changed to read complete • Message transmit unit formats message and injects it into the network and the TSHR entry is marked idle SoC Architecture
Memory-Network Interface • Cache coherence protocols can be implemented with this structure, however TSHR must be extended, e.g., the directory. SoC Architecture
Summary • Network interfaces bridge processor with network, and memory with network • Messaing passing interfaces • Two-register interface • Descriptor-based interface • Shared memory interfaces, complicated by cache coherency. • Processor-Network Interface • Memory-Network Interface SoC Architecture