540 likes | 646 Views
ECE260B – CSE241A Winter 2005 Clocking. Website: http://vlsicad.ucsd.edu/courses/ece260b-w05. Slides courtesy of Prof. Andrew B. Kahng. Outline. Problem Statement Clock Distribution Structures Robustness / Signal Integrity Control Clock Design: Skew Scheduling Topology Construction
E N D
ECE260B – CSE241AWinter 2005Clocking Website: http://vlsicad.ucsd.edu/courses/ece260b-w05 Slides courtesy of Prof. Andrew B. Kahng
Outline • Problem Statement • Clock Distribution Structures • Robustness / Signal Integrity Control • Clock Design: • Skew Scheduling • Topology Construction • Embedding
Why Clocks? • Clocks provide the means to synchronize • By allowing events to happen at known timing boundaries, we can sequence these events • Greatly simplifies building of state machines • No need to worry about variable delay through combinational logic (CL) • All signals delayed until clock edge (clock imposes the worst case delay) FSM Dataflow Comb Logic register register Comb Logic register Courtesy K. Yang, UCLA
Clock Distribution Network • General goal of clock distribution • Deliver clock to all memory elements with acceptable skew • Deliver clock edges with acceptable sharpness • Clocking network design is one of the greatest challenges in the design of a large chip • Consume up to 1/3 of chip power • Accurate signal delay • Signal integrity • Subject to uncertainty / variation of different processes / operating conditions
Clock Design Components • Oscillator • Dividers • Buffers • Strong drivers • Reduce delay • Signal integrity / slew rate • Interconnects • Balanced trees, meshes, etc. • Shielding (e.g., for crosstalk reduction) • Non-tree links / feedback loops
Clock Distribution Objective • Minimum / bounded skew • performance / hold time requirements • Guaranteed slew rate / signal integrity • Small insertion delay • Robustness under process / operating condition variation • Minimum cell / routing area • Minimum powerconsumption
Clock Distribution Robustness Subject to • Radically different loading (flip-flop density) • Across the die • ECO (Engineering Change Order) • Interconnect coupling • Signal integrity • Delay variation • Process variation • From lot-to-lot • Across the die • Buffers • Metal width • Supply voltage variation across the die • Both static IR drop • Dynamic voltage drop • Temperature
Issues in Clock Distribution Network Design • Skew • Process, voltage, and temperature • Data dependence • Noise coupling • Load balancing • Power, CV2f (consume up to 1/3 of total chip power) • Clock gating • Flexibility/Tunability • Compactness – fit into existing layout/design • Facilitate ECO
Clock Skew Causes • Designed (unavoidable) variations – mismatch in buffer load sizes, interconnect lengths • Process variation – process spread across die yielding different Leff, Tox, etc. values • Temperature gradients – changes MOSFET performance across die • IR voltage drop in power supply – changes MOSFET performance across die • Note: Delay from clock generator to fan-out points (clock latency) is not important by itself • BUT: increased latency leads to larger skew for same amount of relative variation Sylvester / Shepard, 2001
Outline • Problem Statement • Clock Distribution Structures • Robustness / Signal Integrity Control • Clock Design: • Skew Scheduling • Topology Construction • Embedding
Clock Distribution Structures • RC-Tree • Less capacitance • More accuracy • Flexible wiring • Grids • Reliable • Less data dependency • Tunable (late in design) • Shown here for final stage drivers driving F/F loads
Gridded clock distribution common on earlier DEC Alpha microprocessors Advantages: Skew determined by grid density, not too sensitive to load position Clock signals available everywhere Tolerant to process variations Usually yields extremely low skew values Disadvantages: Huge amount of wiring and power To minimize such penalties, need to make grid pitch coarser lose the grid advantage Grids Pre-drivers Global grid Sylvester / Shepard, 2001
H-Tree • H-tree (Bakoglu) • One large central driver, recursive structure to match wirelengths • Halve wire width at branching points to reduce reflections • Disadvantages • Slew degradation along long RC paths • Unrealistically large central driver • Clock drivers can create large temperature gradients (ex. Alpha 21064 ~30° C) • Non-uniform load distribution • Inherently non-scalable (wire R growth) • Partial solution: intermediate buffers at branching points courtesy of P. Zarkesh-Ha Sylvester / Shepard, 2001
Buffered H-tree • Advantages • Ideally zero-skew • Can be low power (depending on skew requirements) • Low area (silicon and wiring) • CAD tool friendly (regular) • Disadvantages • Sensitive to process variations • Devices Want same size buffers at each level of tree • Wires Want similar segment lengths on each layer in each source-sink path !!! • Local clocking loads inherently non-uniform Sylvester / Shepard, 2001
Tree Balancing Con: Routing area often more valuable than Silicon Some techniques: a) Introduce dummy loads b) Snaking of wirelength to match delays Sylvester / Shepard, 2001
Examples From Processor Chips H-Tree, Asymmetric RC-Tree (IBM) Grids DEC [Alphas] Serpentines Intel x86 [Young ISSCC97]
Example Skews From Processor Chips DEC-Alpha 21064 clock spines DEC-Alpha 21064 RC delays DEC-Alpha 21164 RC local delays DEC-Alpha 21164 RC delays for Global Distribution (Spine + Grid)
output mesh ReShape Clocks Example (High-End ASIC) • Balanced, shielded H-tree for pre-clock distribution • Mesh for block level distribution • All routes 5-6u M6/5, shielded with 1u grounds • ~10 buffers per node • E.g., ganged BUFx20’s • Output mesh must hit every sub-block
Clumps of 1-6 clock buffers, surrounded by capacitor pads • Shielded input and output m6 shorting straps • Pre-clock connects to input shorting straps • 1u m5 ribs every 20 - 30 u (4 to 6 rows) • Max 600u stride Block Level Mesh (.18u)
Problems with Meshes • Burn more power at low frequencies • Blocks more routing resources (solution: integrated power distribution with ribs can provide shielding for ‘free’) • Difficult for ‘spare’ clock domains that will not tolerate regioning • Post placement (and routing) tuning required • No ‘beneficial skew’ possible • Clock gating only easy at root • Fighting tools to do analysis: • Clumped buffers a problem in Static Timing Analysis tools • Large shorted meshes a problem for STA tools • What does Elmore delay calculation look like for a non-tree? • Need full extraction and SPICE-like simulation to determine skew
Benefits of Meshes • Deterministic since shielded all the way down to rib distribution • No ECO placement required: all buffers preplaced before block placement • Low latency since uses shorted (= ganged, parallel) drivers, therefore lower skew • ECO placements of FFs later do not require rebalancing of tree • “Idealized” clocking environment for “concurrent dance” of RTL design and timing convergence
Hybrid Structure • Balanced tree on the top • Mesh in the middle • Minimize skew • Steiner minimum tree at the bottom • Minimize cost • Facilitate ECO
Outline • Problem Statement • Clock Distribution Structures • Robustness / Signal Integrity Control • Clock Design: • Skew Scheduling • Topology Construction • Embedding
Process Variation • Intra-die and inter-die variations • Intra-die variation is increasingly significant since 0.13um technology • Systematic and random variations • Systematic variation is due to equipment, process, etc. • Global len aberration in lithograthy causes systematic variation • Pattern-dependent optical proximity, chemical mechanical polish (CMP) • Random variation is due to inherent variation • Spatial correlation across a chip • Fast vs.slowcorners
Process Variation • Metal wires • Width variation can be estimated by LUT(width, spacing) • Thickness variation CMP local density • Thickness variation also depends on wire width and spacing • Could be up to 30-40% in 90nm process • Transistors • Channel length variation (delay ~ L1.5) • Thin gate oxide tox variation Vth variation • Up to 30% variation in term of driving capability
Process Variations – SPICE model • Process variations are reflected into a statistical SPICE model • Usually only a few parameters have a statistical distribution (e.g. : {DL, DW, TOX,VTn, VTp}) and the others are set to a nominal value • The nominal SPICE model is obtained by setting the statistical parameters to their nominal value Slide courtesy of A. Nardi, J. Rabaey, K. Keutzer of UCB
Global Variations (Inter-die) Process variations Performance variations Critical path delay of a 16-bit adder All devices have the same set of model parameters value Slide courtesy of A. Nardi, J. Rabaey, K. Keutzer of UCB
Local Variations (Intra-die) • Each device instance has a slightly different set of model parameter values (aka device mismatch) • The performance of some analog circuits strongly depends on the degree of matching of device properties • Digital circuits are in general more immune to mismatch, but clock distribution network is sensitive (clock skew) Slide courtesy of A. Nardi, J. Rabaey, K. Keutzer of UCB
Statistical Design • Need to account for process variations during design phase • Statistical design • Nominal design • Yield optimization • Design centering Slide courtesy of A. Nardi, J. Rabaey, K. Keutzer of UCB
Statistical Design Slide courtesy of A. Nardi, J. Rabaey, K. Keutzer of UCB
Process Variation Tolerance Enhancement • Rule of thumb: balanced tree • Identical buffers at identical heights • Drive identical subtree loads • Can we do better than this? • Process variation tolerant clock design • Bounded-skew DME • Topology construction • With process variation tolerance in objective • Useful skew scheduling • To the center of permissible ranges
Signal Integrity • Crosstalk • Capacitive, inductive • Supply voltage drop • IR, L dI/dt, LC resonance • Temperature • Increased resistance with higher temperature • Substrate coupling • Parasitic resistance, capacitance in the substrate layer
Crosstalk • Due to the coupling capacitance between interconnections, a signal switching on a net (aggressor) may affect the voltage waveform on a neighboring net (victim) Noise Propagation Increased Delay
GND GND VDD Substrate Design for Crosstalk • It can be both capacitive and inductive • Capacitive is dominant at current switching speeds • To reduce it: • Use of shielding layer (inter-layer) • Use of shielding wire (intra-layer)
Clock Gating • Reduce power consumption by temporarily shutting down part of the circuit • Additional cost of enabling circuits FF FF combinational logic Q D CLK1 CLK2 CLK ENABLING
Outline • Problem Statement • Clock Distribution Statement • Robustness / Signal Integrity Control • Clock Design: • Skew Scheduling • Topology Construction • Embedding
D : longest path d : shortest path FF FF -d + thold < Skew < Tperiod - D - tsetup race condition safe cycle time violation permissible range Skew = Local Constraint • Timing is correct as long as the clock signals of sequentially adjacent FFs arrive within a permissible skew range W. Dai, UC Santa Cruz
FF FF FF 6 ns 2 ns 4 0 4 0 “2 0 2”: more safety margin 2 -2 “Useful Skew” Design Robustness • Design will be more robust if clock signal arrival time is in the middle of permissible skew range, rather than on edge T = 6 ns “0 0 0”: at verge of violation W. Dai, UC Santa Cruz
Constraints on Skews • FFi receives clock signal delayed by xi MIN_DEL • 0 < 1 : if nominal clock delay is xi, then actual clock delay must fall within interval xi x xi • For FF to operate correctly when clock edge arrives at time x, the correct input data must be present and stable during the time interval (x – SETUP, x + HOLD) • For 1 i,j L (#FFs), we compute lower and upper bounds MIN(i,j) and MAX(i,j) for the time that is required for a signal edge to propagate from FFi to FFj • Avoid double-clocking (race condition) • xi + MIN(i,j) xj + HOLD • Avoid zero-clocking • xj + SETUP + MAX(i,j) xj + P; P = clock period
Optimal Useful Skews by Linear Programming • LP_SPEED (clock period reduction): minimize P s.t. xj - xj HOLD – MIN(i,j) xi– xj + P SETUP + MAX(i,j) xi MIN_DEL • LP_SAFETY (robustness): Maximize M s.t. xj - xj – M HOLD – MIN(i,j) xi– xj – M SETUP + MAX(i,j) – P xi MIN_DEL • Notes • J. P. Fishburn, “Clock Skew Optimization”, IEEE Trans. Computers 39(7) (1990), pp. 945-951. • T. G. Szymanski, “Computing Optimal Clock Schedules”, Proc. DAC, June 1992, pp. 399-404. • Useful Skew optimization is similar to Retiming optimization • Peak current reductions are a side benefit
Outline • Problem Statement • Clock Distribution Structures • Robustness / Signal Integrity Control • Clock Design: • Skew Scheduling • Topology Design • Embedding • For zero skew (ZST-DME) • For bounded skew (BST-DME)
Zero-Skew Tree (ZST) Problem • Zero Skew Clock Routing Problem (S,G): Given a set S of sink locations and a connection topology G, construct a ZST T(S) with topology G and having minimum cost. • Skew = maximum value of |td(s0,si) – td(s0,sj)| over all sink pairs si, sj in S. • Td = signal delay (from source s0) • Connection topology G = rooted binary tree with nodes of S as leaves • Edge ea in G is the edge from a to its parent • |ea| is the (assigned) length of edge ea • Cost = total edge length
A Zero-Skew Routing Algorithm • Finds a ZST under linear delay model with minimum cost over all ZSTs with topology G and sink set S • Terms • Manhattan Arc: line segment with slope +1 or –1 • Tilted Rectangular Region (TRR): collection of points within a fixed distance of a Manhattan arc • Core = Manhattan arc • Radius = distance • Merging segment = locus of feasible locations for a node v in the topology, consistent with minimum wirelength • If v is a sink, then ms(v) = {v} • If v is an internal node, then ms(v) is the set of all points within distance |ea| of ms(a), and within distance |eb| of ms(b)
Phase 1: Tree of Merging Segments • Goal: Construct a tree of merging segments corresponding to topology G • Merging segment of a node depends on merging segment of its children bottom-up construction • Let a, b be children of v. We want placements of v that allow TSa and TSb to be merged with minimum added wire while preserving zero skew • Merging cost = |ea| + |eb| • Fact: The intersection of two TRRs is also a TRR and can be found in constant time • Constant time per each new merging segment linear time (in size of S) to construct entire tree
Phase 2: Find Node Placements • Goal: Find exact locations (“embeddings”) pl(v) of internal nodes v in the ZST topology • If v is the root node, then any point on ms(v) can be chosen as pl(v) • If v is an internal node other than the root, and p is the parent of v, then v can be embedded at any point in ms(v) that is at distance |ev| or less from pl(p) • Detail: create square TRR trrp with radius ev and core equal to pl(p); placement of v can be any point in ms(v) trrp • Each instruction executed at most once for each node in G, and TRR intersection is O(1) time Find_Exact_Placements is O(n) DME is O(n)
Outline • Problem Statement • Clock Distribution Structures • Robustness / Signal Integrity Control • Clock Design: • Skew Scheduling • Topology Design • Embedding • For zero skew (ZST-DME) • For bounded skew (BST-DME)