710 likes | 848 Views
The Yin and Yang of Hardware Heterogeneity: Can Software Survive? Kathryn S McKinley. Computation Turing 1936. The Transistor Shockley, Bardeen, Brattain 1947. Virtuous Cycle. Doubling of Transistors faster, smaller, cheaper, …. Software Innovation. Software Innovation.
E N D
The Yin and Yang of Hardware Heterogeneity: Can Software Survive? Kathryn S McKinley
Virtuous Cycle Doubling of Transistors faster, smaller, cheaper, … • Software Innovation • Software Innovation • Device Innovation Software Complexity Hardware Complexity Sequential Interface Sequential Interface
Dennard Scaling is overPower = Clock Speed × Voltage2 Performance Performance Power
Electricity Dark silicon Electricity costs in U.S. Data Centers $$$$$$ 2011 $7.4 billion 2006 $4.5 billion [U.S. EPA 2007] Battery life [Goulding et al. Hot Chips 2010]
Multicore Hardware 2003 Pentium4 (130) 2006 C2D(65) C2Q(65) 2008 i7(45) 2008 Atom(45) 2009 C2D(45) 2009 AtomD(45) 2010 i5 (32) 130nm 55M tran. 131mm2 1 core 2-way SMT Northwood 45nm 731M tran. 263mm2 4 cores 2-way SMT Bloomfield 45nm 47M tran. 36mm2 1 core 2-way SMT Diamondville 45nm 228M tran. 82mm2 2 cores no SMT Wolfdale 45nm 176M tran. 87mm2 2 cores + GPU 2-way SMT Pineview 32nm 382M tran. 81mm2 2 cores 2-way SMT Clarkdale 65nm 291M tran. 143mm2 2 cores no SMT Conroe Kentsfield Each die shown at correct scale
Virtuous Cycle End of Dennard Scaling Doubling of Transistors faster, smaller, cheaper, … • Software Innovation • Software Innovation • Device Innovation Software Complexity Hardware Complexity Parallel Interface Sequential Interface Sequential Interface
AppWriters Software People Computer Scientists 14
Languages People Use • C C++ PHP
Software Fast enough Performance Productivity Managing complexity Abstractions 16
Multicore Hardware 2003 Pentium4 (130) 2006 C2D(65) C2Q(65) 2008 i7(45) 2008 Atom(45) 2009 C2D(45) 2009 AtomD(45) 2010 i5 (32) 130nm 55M tran. 131mm2 1 core 2-way SMT Northwood 45nm 731M tran. 263mm2 4 cores 2-way SMT Bloomfield 45nm 47M tran. 36mm2 1 core 2-way SMT Diamondville 45nm 228M tran. 82mm2 2 cores no SMT Wolfdale 45nm 176M tran. 87mm2 2 cores + GPU 2-way SMT Pineview 32nm 382M tran. 81mm2 2 cores 2-way SMT Clarkdale 65nm 291M tran. 143mm2 2 cores no SMT Conroe Kentsfield Each die shown at correct scale
Workloads 61 benchmarks native managed Native Non-scalable 27 C, C++, Fortran SPEC CPU2006 Java Non-scalable 18 Java SPEC jvm98, DaCapo, pjbb2005 0.25 0.25 non-scalable Native Scalable 11 C, C++ PARSEC Java Scalable 5 Java DaCapo 0.25 0.25 scalable
Power vs Performance Power is benchmark dependent 2008 2003 Pentium 4 (130) i7 (45) 2006 Core 2 Duo (65) i5 (32) 2010 Core 2 Duo (45) ? 2008 ?
Pareto Analysis (45nm)Workload determines energy efficient architecture
Parallelism & Heterogeneity big 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 custom small 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Motivation Single ISA Heterogeneity NVIDIA Tegra3 5 Cortex A9 (4 x 1.4 GHz, 1 x 500 MHz) Texas Instruments OMAP5432 2 Cortex A15 + 2 Cortex M4
Heterogeneous Hardware Energy Efficiency Complexity
Heterogeneous parallel hardware + software ? July 31, 1922. Train wreck at Laurel, Maryland [Washington Post, August 1, 1922]
Exploiting Heterogeneity Parallelism Ubiquity Differentiation
Case Studies Mobile UI [UIST’13] Managed runtime [ISCA’12] Always awake [NSDI’12] Interactive cloud [ICAC’13]
Case Studies Mobile UI[UIST’13] Managed runtime [ISCA’12] Always awake [NSDI’12] Interactive cloud [ICAC’13]
User Interface I/O on OMAP big/little Kihm, Guimbretière UIST’13 Key board Scrolling Inking App 0 0 0 0 0 0 0 0 A9 big cores M3 little cores
User Interface I/O& Heterogeneity UI I/O Characteristics Parallelism Ubiquity Differentiated
A9+M3 Heterogeneity Battery life Increase
Case Studies Mobile UI [UIST’13] Managed runtime[ISCA’12] Always awake [NSDI’12] Interactive cloud [ICAC’13]
VM Services on little coresCao et al., ISCA’12 VM Services GC + JIT App 0 0 0 0 0 0 0 big cores little cores
VM Services & Heterogeneity VM Services Characteristics Parallelism Ubiquity Differentiated
Measured (fill) Model (empty) Better Better Better Better 2.8 GHz AMD + 2.8 GHz AMD| 2.8 GHz AMD + 1.66 GHz Atom
Case Studies Mobile UI [UIST’13] Managed runtime [ISCA’12] Always awake[NSDI’12] Interactive cloud [ICAC’13]
SomniloquyApplication stubs on little cores wake up big core as neededAgarwal et al., NSDI’12 Wakeup filters App stubs Embedded OS Network stack Applications filtering, notifications, downloads, keep alive Laptop Host Big core RAM, peripherals, … Apps Somniloquy daemon OS Network stack CPU + DRAM + Flash Little core Network interface
Case Studies ✔ One Time Improvements Mobile UI [UIST’13] Managed runtime [ISCA’12] Always awake [NSDI’12] Interactive cloud [ICAC’13]
Case Studies Mobile UI [UIST’13] Managed runtime [ISCA’12] Always awake [NSDI’12] Interactive cloud [ICAC’13]
Interactive Cloud Services Bing, Finance, Recommendations, GamesRen et al. ICAC’13 Interactive Services Characteristics Parallelism Ubiquity Differentiated
Interactive ServicesWorkload Characterization of Bing Search Completion vs Quality Quality Completion Ratio Responsiveness deadline ~100ms = interactive Partial execution trades quality for responsiveness