Technologies for Reducing Power

Technologies for Reducing Power Trevor Mudge Bredt Family Professor of Engineering Computer Science and Engineering The University of Michigan, Ann Arbor SAMOS X, July 18th 2010

Technologies for Reducing Power • Near threshold operation • 3D die stacking • Replacing DRAM with Flash memory

Background • Moore’s Law • density of components doubles without increase in cost every 2 years • ignoring NRE costs, etc. • 65➞45➞32➞22➞16➞11➞8nm (= F nanometers) • Intel has 32nm in production • Energy per clock cycle • Vdd is the supply voltage, f is the frequency, C capacitanceand Ileak is the leakage current • Vth is the “threshold” voltage at which the gate switches • e.g. Vth ≈ 300 mV and Vdd ≈ 1V

The Good Old Days—Dennard Scaling If s is the linear dimension scaling factor (s ≈ √2) • Device dimension tox, L, W 1/s • Voltage V 1/s • Current I 1/s • Capacitance εA/t 1/s • Delay VC/I 1/s • Power VI 1/s2 • Power density VI/A 1

Recent Trends Circuit supply voltages are no longer scaling… Therefore, power doesn’t decrease at the same rate that transistor count is increasing – energy density is skyrocketing! Form factor vs. Battery Life Environmental Concerns Stagnant Shrinking A = gate area  scaling 1/s2 C = capacitance  scaling < 1/s Dynamic dominates The emerging dilemma: More and more gates can fit on a die, but cooling constraints are restricting their use

Impact on Dennard scaling If s is the linear dimension scaling factor ≈ √2 • Device dimension tox, L, W 1/s • Voltage V 1/s ➞ 1 • Current I 1/s • Capacitance εA/t 1/s • Delay VC/I 1/s ➞ 1 • Power VI 1/s2 ➞ 1/s • Power density VI/A 1➞ s

Techniques for Reducing Power • Near threshold operation—Vdd near Vth • 3D die stacking • Replacing DRAM with Flash memory

Today: Super-Vth, High Performance, Power Constrained Super-Vth Energy / Operation Log (Delay) 0 Vth Vnom Supply Voltage 3+ GHz 40 mW/MHz Normalized Power, Energy, & Performance Energy per operation is the key metric for efficiency. Goal: same performance, low energy per operation Core i7

Subthreshold Design Sub-Vth Super-Vth ~16X Energy / Operation 500 – 1000X Log (Delay) 0 Vth Vnom Supply Voltage Operating in the sub-threshold gives us huge power gains at the expense of performance  OK for sensors!

NTC Near-Threshold Computing (NTC) Sub-Vth Super-Vth ~6-8X Energy / Operation ~2-3X Log (Delay) ~50-100X ~10X 0 Vth Vnom Supply Voltage • Near-Threshold Computing (NTC): • 60-80X power reduction • 6-8X energy reduction • Invest portion of extra transistors from scaling to overcome barriers

Restoring performance • Delay increases by 10x • Computation requires N operations • Break into N/10 parallel subtasks—execution time restored • Total energy is still 8X less—operation count unchanged • Power 80X less • Predicated on being able to parallelize workloads • Suitable for a subset of applications—as noted earlier • Streams of independent tasks—a server • Data parallel—signal/image processing • Important to have a solution for code that is difficult to parallelism—single thread performance

Interesting consequences: SRAM • SRAM has a lower activity rate than logic • VDD for minimum energy operation (VMIN) is higher • Logic naturally operates at a lower VMIN than SRAM—and slower Total Dynamic Leakage

NTC—Opportunities and Challenges • Opportunities: • New architectures • Optimize processes to gain back some of the 10X delay • 3D Integration—fewer thermal restrictions • Challenges: • Low Voltage Memory • New SRAM designs • Robustness analysis at near-threshold • Variation • Razor and other in-situ delay monitoring techniques • Adaptive body biasing • Performance Loss • Many-core designs to improve parallelism • Core boosting to improve single thread performance

Proposed Parallel Architecture 1. R. Dreslinski, B. Zhai, T. Mudge, D. Blaauw, and D. Sylvester. An Energy Efficient Parallel Architecture Using Near Threshold Operation. 16th Int. Conf. on Parallel Architectures and Compilation Techniques (PACT), Romania, Sep. 2007, pp. 175-188. 2. B. Zhai, R. Dreslinski, D. Blaauw, T. Mudge, and D. Sylvester. Energy Efficient Near-threshold Chip Multi-processing. Int. Symp. on Low Power Electronics and Design - 2007 (ISLPED), Aug. 2007, pp. 32-37.

Cluster Results • Baseline • Single CPU @ 233MHz • NTC 4-Core • One core per L1 • 53% Avg. savings over Baseline • Clustered NTC • Multiple cores per L1 • 3 cores/cluster • 2 clusters • 74% Avg. savings over Baseline 230MHz Equivalent Performance Cholesky Benchmark 48%

New NTC Architectures Next Level Memory Next Level Memory BUS / Switched Network BUS / Switched Network Cluster L1 Cluster Cluster L1 L1 L1 L1 Core Core Core Core Core Cluster L1 L1 L1 L1 L1 Core Core Core Core • Recall, SRAM is run at a higher VDD than cores with little energy penalty • Caches operate faster than core • Can introduce clustered architectures • Multiple Cores share L1 • L1 operated fast enough to satisfy all core requests in 1-cycle • Cores see view of private single cycle L1 • Advantages (leading to lower power): • Clustered sharing • Less coherence/snoop traffic • Drawbacks (increased power): • Core conflicts evicting L1 data (more misses) • Additional bus/Interconnect from cores to L1 (not as tightly coupled)

Digression—Chip Makers Response perf perf • Exchanged cores for frequency • multi / many-cores • Risky behavior • “if we build it, they will come” • predicated on the solution to a tough problem—parallelizing software • Multi-cores have only been successes in • throughput environments—servers • heterogeneous environments—SoCs • data parallel applications • Parallel processing is application specific • that’s OK • treat parallel machines a attached processors • true in SoCs for some time—control plane / data plane separation freq # cores

Measured thread level parallelism—TLP Caveat: Desktop Applications Evolution of Thread-Level Parallelism in Desktop Applications G. Blake, R. Dreslinski, T. Mudge, University of Michigan, K. Flautner ARM, ISCA 2010, to appear.

Single thread performance: Boosting • Baseline • Cache runs 4x core frequency • Pipelined cache • Better Single Thread Performance • Boosting • Turn some cores off, speed up the rest • Cache frequency remains the same • Cache un-pipelined • Faster response time • Same throughput • Core sees larger cache, hiding longer DRAM latency • Increase core voltage and frequency further • Overclock • Cache frequency must be increased • Even faster response time • Increased throughput 4 Cores @15MHz (650mV) Cache @ 60MHz (700mV) 4x Cluster Cluster Cluster L1 L1 L1 8x Core Core Core Core Core Core Core Core Core Core Core Core 1 Core @60MHz (850mV) Cache @ 60MHz (1 V) 1 Core @120MHz (1.3V) Cache @ 120MHz (1.5V)

Single Thread Performance • Look at turning off cores and speeding the remaining cores to gain faster response time. • Graph of cluster performance (not measured – intuition) R. Dreslinski, M. Wieckowski, D. Blaauw, D. Sylvester, and T. Mudge. Near Threshold Computing: Overcoming Performance Degradation from Aggressive Voltage Scaling. Workshop on Energy-Efficient Design (WEED 2009), held at 36th Int. Symp. on Computer Architecture, Austin, TX, June, 2009.

Boosting Clusters—scaled to 22nm • Baseline • Cache runs 4x core frequency • Pipelined cache • Better Single Thread Performance • Turn some cores off, speed up the rest • Cache frequency remains the same • Cache un-pipelined • Faster response time • Same throughput • Core sees larger cache, hiding longer DRAM latency • Boost core voltage and frequency further • Cache frequency must be increased • Even faster response time • Increased throughput 4 Cores @140MHz Cache @ 60MHz 4x Cluster Cluster Cluster L1 L1 L1 8x Core Core Core Core Core Core Core Core Core Core Core Core 1 Core @600MHz Cache @ 600MHz 1 Core @1.2GHz Cache @ 1.2GHz

A Closer Look at Wafer-Level Stacking “Super-Contact” Preparing a TSV—through silicon via Oxide Silicon Dielectric(SiO2/SiN) Gate PolySTI (Shallow Trench Isolation) W (Tungsten contact & via) Al (M1 – M5) Cu (M6, Top Metal) Bob Patti, CTO Tezzaron Semiconductor

Next, stack second wafer & thin FF: face-to-face Bob Patti, CTO Tezzaron Semiconductor

Then stack a third wafer 3rd wafer 2nd wafer 1st wafer: controller FB: face-to-back Bob Patti, CTO Tezzaron Semiconductor

Finally, flip, thin and add pads 1st wafer: controller 2nd wafer 3rd wafer This is the completed stack Bob Patti, CTO Tezzaron Semiconductor

Characteristics • Very high bandwidth low-latency low-power buses possible • 10,000 vias / sq mm • Electrical characteristics: ∼ 1fF and < 1Ω • No I/O pads for inter-stack connections—low-power • Consider a memory stack: • DDR3 ~40mW per pin • 1024 Data pins →40W • 4096 Data pins →160W • die on wafer ~24uW per pin • Pros / cons • 3D interconnect failure < 0.1ppm • Heat—1 W/sq mm • KGD may be a problem—foundry • Different processes can be combined • DRAM / logic / analog / non-volatile memories • e.g. DRAM—split sense amps and drivers from memory cells

Centip3De—3D NTC Project Logic - A Logic - B F2F Bond Logic - B Logic - A DRAM Sense/Logic – Bond Routing DRAM F2F Bond DRAM • Centip3De Design • 130nm, 7-Layer 3D-Stacked Chip • 128 - ARM M3 Cores • 1.92 GOPS @130mW • tapedout: Q1 2010

Stacking the Die • Cluster Configuration • 4 Arm M3 Cores @ 15MHz (650mV) • 1 kB Instruction Cache @ 60MHz (700mV) • 8 kB Data Cache @60 MHz (700mV) • Cores connect via 3D to caches on other layer Core Cluster Core Cluster ARM M3 ARM M3 ARM M3 ARM M3 • System Configuration • 2-Wafer = 16 Clusters (64 Cores) • 4-Wafer = 32 Clusters (128 Cores) • DDR3 Controller ARM M3 ARM M3 ARM M3 ARM M3 ARM M3 ARM M3 ARM M3 ARM M3 ARM M3 ARM M3 ARM M3 ARM M3 8kB D-Cache 8kB D-Cache 8kB D-Cache 8kB D-Cache • Estimated Performance (Raytrace) • 1.6 GOPS (0.8 GOPS on 2-Wafer) • 110mW (65mW on 2-Wafer) • 14. 8 GOPS/W • Fair Metric • Centip3De achieves 24 GOPS/W without DRAM 4kB I-Cache 4kB I-Cache 4kB I-Cache 4kB I-Cache Cache Control Cache Control Cache Control Cache Control

1.9 GOPS (3.8 GOPS in Boost) Max 1 IPC per core 128 Cores 15 MHz 130 mW 14.8 GOPS/W (5.5 in Boost) Design Scaling and Power Breakdowns NTC Centip3De System • ~600 GOPS (~1k GOPS in Boost) • Max 1 IPC per core • 4,608 Cores • 140 MHz • ~3W • ~200 GOPS/W 130nm To 22nm Raytracing Benchmark

FIN—Thanks

Background – NAND Flash overview • Dual mode SLC/MLC Flash bank organization • Single Level Cell (SLC) • 1 bit/cell • 105 erases per block • 25 μs read, 200 μs write • Multi Level Cell (MLC) • 2 bits/cell • 104 erases per block • 50 μs read, 680 μs write Dual Mode NAND Flash memory • Addressable read/write unit is the page • Pages consist of 2048 bytes + 64 ‘spare’ bytes • Erases 64 SLC or 128 MLC pages at a time (a block) • Technology – less than 60nm

Reducing Memory Power NAND Flash cost assumes 2-bit-per-cell MLC. DRAM is a 2 Gbit DDR3-1333 x8 chip. Flash power numbers are a 2 Gbit SLC x8 chip. Area from ITRS Roadmap 2009. • Flash is denser than DRAM • Flash is cheaper than DRAM • Flash good for idle power optimization • 1000× less power than DRAM • Flash not so good for low access latency usage model • DRAM still required for acceptable access latencies • Flash “wears out” – 10,000/100,000 write/erase cycles

A Case for Flash as Secondary Disk Cache • Many server workloads use a large working-set (100’s of MBs ~ 10’s of GB and even more) • Large working-set is cached to main memory to maintain high throughput • Large portion of DRAM to disk cache • Many server applications are more read intensive than write intensive • Flash memory consumes orders of magnitude less idle power than DRAM • Use DRAM for recent and frequently accessed content and use Flash for not recent and infrequently accessed content • Client requests are spatially and temporally a zipf like distribution • e.g. 90% of client requests are to 20% of files

A Case for Flash as Secondary Disk Cache Specweb99 An access latency of 100’s of microseconds can be tolerated. T. Kgil, and T, Mudge. FlashCache: A NAND Flash memory file cache for low power web servers. Proc. Conf. Compiler and Architecture Support for Embedded Systems (CASES'06), Seoul, S. Korea, Oct. 2006, pp. 103-112.

Overall Architecture Processors Main memory 128MB DRAM FCHT FBST FPST FGST Flash ctrl. 1GB Flash DMA HDD ctrl. Hard Disk Drive Processors Tables used to manage Flash memory Main memory 1GB DRAM Generic main memory + Primary disk cache 128MB DRAM 1GB Flash Secondary disk cache HDD ctrl. Hard Disk Drive Baseline without FlashCache FlashCache Architecture

Overall Network Performance - Mbps 128MB DRAM + 1GB NAND Flash performs as well as 1GB DRAM while requiring only about 1/3 die area (SLC Flash assumed) Specweb99

Overall Main Memory Power 2.5W 1.6W Flash Memory consumes much less idle power than DRAM 0.6W SpecWeb99

Concluding Remarks on Flash-for-DRAM • DRAM clockless refresh reduces idle power • Flash density continues to grow • Intel-Micron JV announced 25nm flash • 8GB die 167sq mm 2 bits per cell • 3 bits/cell is coming soon • PCRAM appears to be an interesting future alternative • I predict single level storage using some form of NV memory with disks replacing tape for archival storage

Cluster Size and Boosting • Analysis: • 2-Die stack • Fixed die size • Fixed amount of cache per core • Raytrace Algorithm Boosting 4-Core Version Achieves 81% more GOPS than 1-Core (Larger Cache) 4-Core clusters are 27% more energy efficient than 1-Core clusters

System Architecture ClockSystem CommFwd DRAMDRAMJTAG B-B Interface B-B Interface Sys Ctrl Clock 4-CoreCluster 4-CoreCluster 4-CoreCluster 4-CoreCluster Layer Hub Layer Hub Layer Hub Layer Hub 4-CoreCluster 4-CoreCluster 4-CoreCluster 4-CoreCluster 4-CoreCluster 4-CoreCluster 4-CoreCluster 4-CoreCluster 4-CoreCluster 4-CoreCluster 4-CoreCluster 4-CoreCluster Layer B Layer A JTAG DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM Mem Fwd Mem Fwd

Cluster Architecture AMBA-likeBuses (32-bit)3D integration JTAG In 60 MhzCacheClock ClusterI-Cache1024b4-Way 65 MhzCacheClock 60 MhzSystemClock ClusterD-Cache8192b4-Way M3 M3 M3 M3 Cluster ClockGen AMBA-likeBusses ToDRAM(128-bit) 15 Mhz CoreClocks with0,90,180,270degree phase offsets ClusterMMIO, Reset Ctrl,etc SystemCommunication JTAG Out Layer 1 Layer 2

With With power and cooling becoming an increasingly costly part of the operating cost of a server, the old trend of striving for higher performance with little regard for power is over. Emerging semiconductor process technologies, multicore architectures, and new interconnect technology provide an avenue for future servers to become low power, compact, and possibly mobile. In our talk we examine three techniques for achieving low power: 1) Near threshold operation; 2) 3D die stacking; and 3) replacing DRAM with Flash memory.power and cooling becoming an increasingly costly part of the operating cost of a server, the old trend of striving for higher performance with little regard for power is over. Emerging semiconductor process technologies, multicore architectures, and new interconnect technology provide an avenue for future servers to become low power, compact, and possibly mobile. In our talk we examine three techniques for achieving low power: 1) Near threshold operation; 2) 3D die • ; and 3) replacing DRAM with Flash memory.

Solutions? • Reduce Vdd • “Near Threshold Computing” • Drawbacks • slower less reliable operation • But • parallelism • suits some computations (more as time goes by) • robustness techniques • e.g. Razor—in situ monitoring • Cool chips (again) • interesting developments in microfluidics • Devices that operate at lower Vdd without performance loss

Low-Voltage Robustness • VDD scaling reduces SRAM robustness • Maintain robustness through device sizing and VTH selection • Robustness measured using importance-sampling • In NTC range 6T is smaller

SRAM Designs • HD-SRAM • Differential write • Single-ended read • Asymmetric sizing • HD µ/σ = 12.1 / 1.16 • 6T µ/σ = 11.0 / 1.35 • Crosshairs • Skew VDD in column • Skew GND in row • Target failing cells • No bitcell changes • Skew hurts some cells

Evolution of Subthreshold Designs Subliminal 1 Design (2006) -0.13 µm CMOS -Used to investigate existence of Vmin -2.60 µW/MHz Subliminal 2 Design (2007) -0.13 µm CMOS -Used to investigate process variation -3.5 µW/MHz Phoneix 1 Design (2008) - 0.18 µm CMOS -Used to investigate sleep current -2.8 µW/MHz • Phoenix 2 Design (2010) • - 0.18 µm CMOS • -Commercial ARM M3 Core • -Used to investigate: • Energy harvesting • Power management • -37.4 µW/MHz Unpublished Results – ISSCC 2010 – Do not disclose

Technologies for Reducing Power