Asynchronous Logic: Results and Prospects

Asynchronous Logic:Results and Prospects Alain J. MartinCalifornia Institute of Technology NTU, March 2007

What Is Asynchronous Logic?

“An algorithm is a sequence of computational steps.” CL&R How do we implement sequencing in a continuous physical medium? Traditional answer: use of a global time reference (“the clock”) CLK A B C D E Sequencing and Computation

Yes!: “asynchronous” or “clockless” logic Also “self-timed” or “speed-independent” David Muller “Theory of Asynchronous Circuits” (1959) ILLIAC (1959) and ILLIAC II (1962) partially asynchronous PDP6 (1960) asynchronous Can we compute without a clock?

Delay-insensitivity (Molnar, 198x…) Almost: “The class of delay-insensitive circuits is limited (not Turing-complete).” (Martin, 1990) Quasi-delay-insensitive (QDI) logic: Delay-insensitive Isochronic forks (only delay assumption) QDI is Turing-complete (Martin & Manohar, 1996) Can we compute without a clock and without delay assumptions?

Asynchronous system: collection of modules communicating by handshake protocols Distributed system on a chip (communicating by message exchange) A B C D E ack ack ack ack ack What is an Asynchronous Circuit?

Caltech QDI Approach • Quasi delay-insensitive (QDI) design • Minimal delay assumptions (only isochronic forks) • Stricter logic synthesis (DI codes for datapath, completion trees), but… • Robust and efficient (no evidence that delay assumptions improve efficiency)

Why Asynchronous and QDI Logic?

Scientific Reasons • Understanding the role of time in computation • Limit of delay insensitivity • Implementing a digital computation directly in a continuous physical medium • Design by program transformation (real correctness-by-construction approach) • “VLSI design as programming” paradigm

Engineering Reasons • Better match for high-level synthesis • Can separate correctness from performance issues • Modularity and better use of concurrency • Large system design (SoC): Only local communication • Efficiency • Average-case instead of worst-case behavior • Less pressure for global optimization(“timing closure”) • Robustness and reliability • Robust to variations in fabrication technology, temperature, voltage, noise, SEU-tolerance • Energy efficiency

Energy Advantages of Async • No clock • Up to 50% of clock power recuperated • Automatic shut-off of idle parts • Perfect clock gating • No glitches (spurious transitions) • Up to 50% of power in combinational circuits • Automatic adaptation to parameter’s variations • Voltage scaling: Perfect exchange of delay against energy through voltage scaling • Flexibility of asynchronous interfaces: • Better use of concurrency

Reactive Use in Embedded Systems • Archetype of a reactive system • Average execution time may be much shorter than maximal execution time • Sleep sequence without race condition • Modeled after wait/signal with condition variables • Instant wake-up from deep sleep

Robustness to PVT Variations • Increase in physical parameter variations (PVT) is becoming a huge problem… • Even worse in future technologies (nano CMOS or others) • Variations of physical parameters all affect timing • Increased timing variations reduce robustness and/or performance • Single time reference (clock) may become unavailable or too expensive in future technologies and large systems (SoC)

Robustness to Voltage and Temperature Variations

Single-event Upset and Soft-error Tolerance of QDI circuits • Soft-errors caused by alpha particles, cosmic rays and other radiation sources are becoming increasingly problematic, even at ground-level • QDI circuits can absorb most “dose-effects” • Single-event upsets that cause a soft-error (bit flip) can be corrected efficiently in QDI circuits • Error-correction scheme specific to QDI • Entire async microcontroller SEU-tolerant

Detection and Correction of SE in QDI circuits • Single-error detection: duplicate and compare • Correction: • prevent propagation of detected SE • stability of guards corrects automatically • “Detection is correction” • Simplest, most expensive coding, but simplest detection mechanism • Entire microcontroller SEU tolerant

Disadvantages of Async • Size overhead (more transistors) • Poorly understood and rarely taught • No industrial CAD tools (yet) • No well-developed testing procedure (yet) • No easy transition path for large established companies…

Experimental Evidence

Asynchronous Chips @ Caltech World-first Asynchronous Microprocessor (1988) Lattice-Structure Filter (1994) Lutonium 8051 Microcontroller (2005) MiniMIPS (1998)

Performance: 5 MIPS, 5mA @ 2V 18 MIPS, 45mA @ 5V 26 MIPS, 100mA @ 10V 16-bit RISC, 2-micron CMOS Formal synthesis: Initial sequential description was a single page of CHP code 5 months from start of project to tape-out (small group) Fully functional on first silicon First Asynchronous Microprocessor (Caltech, 1988) • Potato-chip experiment • Runs on a potato as power supply! • 50kHz @ 0.75V, 300kHz @ 0.9V

Standard 32-bit RISC ISA Single instruction issue, one branch delay slot Precise exceptions 2 on-chip caches: 4kB Icache and 4kB Dcache First prototype (1998): No TLB 2M transistors First asynchronous processor competitive with large synchronous designs Asynchronous MIPS R3000 Microprocessor

MiniMIPS Low-Voltage Operation • Functional from 0.5V Vdd up • Functional at 0.4V with some transistor resizing

Asynchronous MIPS: Practical Results • HP’s 0.6-micron CMOS • Expected: 275 MIPS @ 7W @ 3.3V @ 25oC • First prototype: 190 MIPS @ 4W @ 3.3V @ 25oC • Voltage range: 1V (9.66MHz @ 0.021 W) to 8V • Functional on first silicon despite • Inconsistencies in HP’s process parameters (e.g. higher Vt’s) • Long polysilicon wire overlooked in the critical fetch loop • (Testament to the robustness of asynchronous design style!) • Roughly 4x faster than commercial synchronous MIPS ported to same technology • Note: no particular effort made towards designing for low power.

Lutonium-18: QDI 8051 Microcontroller • TSMC SCN018 through MOSIS • 0.18mm CMOS • 1.8V nominal • |Vt| = 0.4V to 0.5V • Expected area: 5mm2 (including 8kB SRAM) • Performance from low-level simulation (conservative!)

Energy Efficiency Metric: Et2 • E = C*V2 , t = k / V • E*t2 independent of V • Estimate of energy efficiency • Comparison of designs • “Algorithmic of energy’’ • See Chapter 15 in “Power Aware Computing” book by Graybill & Melhem eds. Kluwer

Voltage Scaling Advantage: Comparison to Intel Xscale

Microprocessor -- Results MIPS Energy 33nJ async-0.6m 70nJ sync-0.6m MIPS CycleTime 6ns async-0.6m 21ns sync-0.6m Microcontroller -- Estimation 10.00nJ (1X) sync-0.5m 8051 Energy per Instr 1.67nJ (6X) async-0.5m icache fetch 0.56nJ (18X) async-0.18m@1.8V 0.14nJ (72X) async-0.18m@0.9V exec units (adder) (shifter) (fblock) (mem) (mult/div) 20ns (1X) sync-0.5m 8051 CycleTime 10ns (2X) async-0.5m 5ns (4X) async-0.18m@1.8V decode write back 10ns (2X) async-0.18m@0.9V regfile (bypass) Energy Breakdown and Comparisons More than 100X Et2 improvement over any other 8051 Energy Breakdown

Design Methodology

L0 R0 DATA L1 R1 La Ra ACK Handshakes & Dual-Rail Encoding BUFFER: *[ L?x; R!x ] • Four-phase handshake • Dual-rail encoding: • 3 wires (2 data, 1 ack) for one bit of information • Other DI codes are used: 1-of-N R! L?

A QDI pipeline stage *[ L?x; R!f(x)]

QDI PIPELINE vs Bundled Data • Dual-rail or 1-of-n data encoding • Completion tree • Critics: high overhead (2*N +1 wires and completion tree) • Alternative: Bundled data • N + 1 wires, no completion tree • Delay line for indicating completion, spurious transitions • Big controversy!

Ra Fine-grain Pipeline (PCHB) en R R! f L? validity Rv en Lv validity L? completion La en

FINE-GRAIN PIPELINE • No need for separate register • Very high throughput and low forward latency • Excellent Et^2 performance • Entirely QDI • Used in MiniMIPS and Lutonium • Area overhead significant

CHP Program 2 4 7 8 1 3 5 6 Lower-Level Synthesis: HSE *[ L?x; R!x ] Handshaking Expansion *[ [ RaL0R0RaL1R1 ]; La ; [ RaR0, R1 ]; [ L0L1La  ] ] [ Ld ]; La; [ Ld ]; La  [ Ra ]; Rd; [ Ra ]; Rd

CHP Program Lower-Level Synthesis: PRS Production Rule Set *[ L?x; R!x ] L0L1LvLaRaL0R0LaRaL1R1R0R1RvLvRvLa L0L1LvRaLaR0RaLaR1R0R1RvLvRvLa Handshaking Expansion *[ [ RaL0R0RaL1R1 ]; La ; [ RaR0, R1 ]; [ L0L1La  ] ] To PRS for CMOS …

Each production rule has the form: guardexpr node orguard expr node These can be evaluated asIf ( guard expristrue )node = VddorIf ( guard expr istrue )node = GND A set of production rules must be stable and non-interfering(for hazard-free circuits) Lower-Level Synthesis: PRS Production Rule Set L0L1LvLaRaL0R0LaRaL1R1R0R1RvLvRvLa L0L1LvRaLaR0RaLaR1R0R1RvLvRvLa To PRS for CMOS …

Asynchronous Architectures • New asynchronous solutions for pipelined microprocessors • Execution units are in parallel, allowing concurrent and out-of-order execution of instructions

CAD Tools • Complete suite of tools: synthesis, simulation, verification, optimization, layout • Designer-assisted compilation • Tools are modular and customizable • Main representations: CHP, PRS, Cast

Legend synthesis simulators database Design Flow sequential program chpsim DDD SDD cosim concurrent system prsim/esim spice logical PL2 physical physical PRS add ? ! Placer Router Sizer = sizedPRS collectionof cells placedcells routedcells physicallayout resize usingwire information

Robustness and Reliability

Robustness to Power-Supply Noise HPSICE simulation of a typical QDI asynchronous circuit: A five-stage ring of async (PCHB) pipeline stages. Technology: TSMC 0.18micron CMOS Vdd: 1.8V, Vt : .5V, Complete layout. Vdd is oscillating between 3.5V and 0V (maximal amplitude), and at various frequencies. The circuit keeps working correctly! (It will malfunction at some very high-frequency noise in phase with circuit frequency.)

Robustness to Power-Supply Noise

C C final intermediate SE-Tolerant QDI Circuits ’a z xa za ya xb zb yb z’b

Soft-error Tolerant Asynchronous Microprocessor (STAM) • The STAM architecture defines simplified 32-bit RISC instruction set, which has eight general registers, and four types of instructions: arithmetic, branch, memory and shift operations. • A partially-wired layout of the STAM was completed TSMC.SCN 0.18um CMOS. In SPICE simulation, it runs about 120 MHz. • The soft-error tolerance of the STAM has been tested by injecting errors randomly while the STAM runs the RC4 program (a simple stream cipher) in the digital-level simulator. • About five soft errors, whose locations are chosen randomly from a list of all nodes of the STAM, are injected in each execution of an instruction. • About 25% of 203,000 nets in the STAM experience a bit-flipping in each testing • The figure shows locations of errors by dots and a box in the figure represents a CHP process.

Soft-error Tolerant Asynchronous Microprocessor (STAM)

Async Molecular Nanoelectronics Molecular nano was our motivation for XQDI: Extreme case of variability!

“Extreme” QDI (XQDI) • Can we improve QDI to eliminate (or reduce further) the remaining variability dependencies? • Isochronic forks • Keepers onstate-holding nodes • Slew rates and oscillating rings

Isochronic Forks • Only timing assumption in QDI design • New design style that (1) minimizes the number of isochronic forks, and (2) mitigates their effect • d(single transition) << d(multi-transition path) • One-sided inequality can always be satisfied

Cell Design without Keeper • Keepers needed for state-holding cells • Keeper requires transistor sizing and balancing current strengths. Difficult with variability… • Example of the C-element: With keeper Without keeper

Ring Oscillators • An async system is a collection of rings of operators. Oscillating rings are the engine of an asynchronous circuit. • Right choices of slew rates and number of stages guarantee that each ring oscillates. • What are the limits? How many restoring stages per ring?

Asynchronous Logic: Results and Prospects