1 / 40

Advanced Microarchitecture

Advanced Microarchitecture. Lecture 1: Introduction. Course Floorplan. Intro/Review: 2 lectures Processor Front-end: 5 lectures Execution Core: 4 lectures Other topics: 6 lectures Processor Case Studies: 11 classes Mini-conference: 2 classes. First 8.5 weeks. Next 5.5 weeks.

lona
Download Presentation

Advanced Microarchitecture

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Advanced Microarchitecture Lecture 1: Introduction

  2. Course Floorplan • Intro/Review: 2 lectures • Processor Front-end: 5 lectures • Execution Core: 4 lectures • Other topics: 6 lectures • Processor Case Studies: 11 classes • Mini-conference: 2 classes First 8.5 weeks Next 5.5 weeks Last week of class Course philosophy: (1) First half: learn details about microarchitecture concepts (2) Second half: study real designs, applying what we covered in part 1. Lecture 1: Introduction

  3. Course Components • Lectures: • I’m not taking attendance, but since there’s no textbook, attendance (and being awake) is incredibly important. • There will be four homework assignments for this part. • Supplemental Reading (required): • “The Pentium Chronicles” by Robert P. Colwell, published by Wiley-Interscience, ISBN: 0-471-73617-1 • Must complete reading this before the start of case studies • Case studies: • Paper reading is mandatory… you cannot participate if you haven’t read the paper(s) Lecture 1: Introduction

  4. Course Components (2) • Term Project • Microprocessor/microarchitecture-based project • Project must be approved • Mini-Conference • We will peer-review all projects, similar to how a conference program committee reviews papers • Last week of class will be used to hold a mini-conference where you present your term project • Food and drink will be provided! :-) • No Exams, Hooray! Lecture 1: Introduction

  5. Grading Specifics • 4 Homeworks at 5 points each = 20 pts • 5 TPC reading summaries, 3 pts each = 15 pts • 11 case-study reading summaries and participation,3 pts each = 33 pts • Term project = 32 pts • Abstract/Proposal: 5 pts • Mid-project Status: 2 pts • Write-up: 10 pts • Reviews (of other people’s projects): 5 pts • Final Presentation: 10 pts Lecture 1: Introduction

  6. Case-Study Mechanics • If you don’t do the readings, you’re not going to contribute anything to the discussions, therefore … • For each case-study session, you must do the reading before the start of class • You must also write a brief summary of the readings • You must submit the summary at the start of class… The summary is your entrance ticket to class:If you don’t hand in the summary, I’m not going to let you enter the classroom! Lecture 1: Introduction

  7. Performance • What metric to use? • CPI, IPC, MIPS, FLOPS, polygons/sec, frames/sec, … • Absolute Runtime • “How long will it take to run my program?” • “How long will it take to run my programs?” • Relative Performance • “Will my program run faster on an Intel or AMD cpu?” • “Will my programs run faster on an Intel or AMD cpu?” • “Will my typical program run faster on Intel or AMD?” Lecture 1: Introduction

  8. Runtime = Total Work In Program CPI or 1/IPC 1/f (clock freq.) Iron Law of Performance This is the only performance metric that matters (for the uniprocessor world). Everything else is just a proxy!!! Cycles Seconds Total Insts X X Instruction Cycle Algorithms, Compilers, ISA Extentions Microarchitecture Microarchitecture, Process Tech Lecture 1: Introduction

  9. Multi-Core/Performance • Correct metric depends • Single parallel (multi-threaded) application: • Runtime • Multiple applications (multi-programmed workload): • Typically total system throughput • Latency/Runtime of a given program not so important • Fairness and combined fairness/performance metrics often used. Lecture 1: Introduction

  10. Power • Which power do you mean? • Maximum/peak power delivery requirements • “450W Power Supply” • Average power delivery requirements • Battery life • Electricity bills Lecture 1: Introduction

  11. Dynamic Power • Power to charge/discharge a capacitor • P = VI • I = C dv/dt + V - C Lecture 1: Introduction

  12. Dynamic Power • P = ½CV2fa • C: total capacitance switched • V: power supply voltage • f: clock frequency • a: activity factor • Really, P = Siall blocks Pi = ½fV2× Siall blocksCiai • Ci and ai are hard to determine • Ci requires detailed circuit design • ai depends on dynamic behavior (application specific) Lecture 1: Introduction

  13. Example • Cache Power • Clock frequency = 2 GHz • L1 Instruction Cache: C=1.515 mF, a = 0.88 • L1 Data Cache: C=0.741 mF, a =0.6 • L2 Unified Cache: C=12.7 mF, a = 0.07 • Vdd = 1.5V • PIL1 = ½ * 1.515 mF * (1.5)2 * 2GHz * 0.88 = ½ * 1.515e-9F * 2.25V2 * (1/500e-12 sec) * 0.88 = 3 FV2/s = 3 (columbs/volt)*(volt2)/second = 3 columb*volt/second = 3 (Amp*sec) * (Watt/Amp) / sec = 3 Watts Lecture 1: Introduction

  14. Example • L1 Data Cache: C=0.741 mF, a =0.6 • PDL1 = = 1 Watt • L2 Unified Cache: C=12.7 mF, a = 0.07 • PUL2 = = 2 Watts • Total Power of All Caches = PIL1 + PDL1 + PUL2 = Lecture 1: Introduction

  15. Trading Power and Performance • P = ½CV2fa • f  V • P  ½CV2Va • P  V3 • Perf f  V • Decrease V • Performance drops linearly • Power drops cubically! A.K.A. Voltage- Frequency Scaling Rule of thumb: 3% Power reduction corresponds to about a 1% Performance drop Voltage can be decreased only so far... after that, you can only decrease clock frequency Lecture 1: Introduction

  16. Static Power • “Leakage”, “Dark Current” • Dark current name comes from current measured in photodetectors when no light is present • Two Kinds: • Channel leakage or subthreshold conductance • Gate leakage Lecture 1: Introduction

  17. Source Drain Gate Current Threshold Voltage Current Drain Source - - - - - First, a MOS transistor Gate Applied Voltage + + + + + Lecture 1: Introduction

  18. Drain Vdd + + + + Gate = - - 0 Volts Source NMOS vs. PMOS • P = positive, N = negative Source Gate Drain PMOS NMOS Lecture 1: Introduction

  19. Back to Leakage Gate Leakage Channel Leakage Subthreshold Conductance Lecture 1: Introduction

  20. Oxide Thickness keeps Shrinking (faster transistors) -aTox/V Iox = K2W(V/Tox)2e Gate Channel Length keeps Shrinking (faster transistors) Drain Source -Vth/nVq -V/Vq Isub = K1We (1-e ) Leakage in MOS transistors Probability of Quantum Tunneling Increases (Leakage increases) Channel resistance decreases (Leakage increases) Lecture 1: Introduction

  21. P(Tunnel) Non-negligible P(Tunnel) << 1 Quantum Tunneling • Electrons aren’t “here” or “there” • Location is a probability distribution • Non-zero probability of being anywhere e- e- Oxide Lecture 1: Introduction

  22. Power vs. Performance • ED product (energy * delay) • Lower is better • Lower execution latency (i.e., higher performance) • Lower energy consumption • Can lead to not-so-great configurations • Simple CPU  really long execution time, but very low power  lower ED product (may not be acceptable) • ED2 product • Performance more heavily weighted Lecture 1: Introduction

  23. Thermals • Temperature of the chip determined by • Power/heat generation rate • Heat removal • Given the two, T will settle at a steady state • Heat flow is function of temperature gradient • If there’s too much heat, T will increase until gradient large enough to remove the heat fast enough • So long as this steady state T is within allowed operating conditions, everything should work fine • May have impact on long-term reliability Lecture 1: Introduction

  24. -Vth/nVq -V/Vq Isub = K1We (1-e ) Thermal Runaway • But, leakage is a function of temperature •  Temp leads to  Leakage • Which burns more power • Which leads to  Temp, which … • Positive feedback loop can melt your chip Lecture 1: Introduction

  25. Hot Spots • Average temperature != local temperature • Local spots may be hotter • Leads to “hot spots” • Temp anywhere cannot exceed Tjmax (transistors stop working) • Possible to have good average global/temp but still violate Tjmax locally (Simulated P4 Thermals) Lecture 1: Introduction

  26. When Cooling is Insufficient Lecture 1: Introduction

  27. Coupling Noise current change Wire 1 Wire 1 Wire 2 Wire 2 induced current Magnetic Field Capacitative Coupling Inductive Coupling Lecture 1: Introduction

  28. Extra noise margin  decrease in f Clock cycle time Impact on Performance Clock cycle time Lecture 1: Introduction

  29. Ishower Ishower - Ijohn Pressure Drop Ijohn Flush! Power Supply Noise Water Tank Lecture 1: Introduction

  30. 1.2V 1.5V 1.5V 1.5V 1.5V 1.5V Power Supply Noise Power Supply Pin Local spikes in power consumption can affect other very far away blocks depending on the power distribution network Lecture 1: Introduction

  31. up to 3 mA 0.5mA 0.5mA ++++ ++++ ++++ ++++ 1.5V 1.5V 2 mA 2 mA Decoupling or Debouncing Capacitors (“Decaps”) Same Solution as Water Supply X X 0.75V 1 mA Lecture 1: Introduction

  32. Fabrication Costs • CPU (die) size greatly affects cost • Current CPUs 1-2 cm2 • Embedded much smaller • cost and footprint matters in cell phone or iPod Die Silicon Wafer Lecture 1: Introduction

  33. Manufacturing Defects         Yield 13/16 working chips 81.25% yield 1/4 working chips 25.0% yield Lecture 1: Introduction

  34. Assuming $250 per wafer: $5.92 per die $58.82 per die 17 die, 25.0% yield  4.25 working parts / wafer Yield 52 die, 81.25% yield  42.25 working parts / wafer Lecture 1: Introduction

  35. As technology matures, yield typically improves, which helps to reduce cost. 20” Display $600 30” Display 1.52 = 2.25x area $1800 3x $$$ In 2009: $400? Yield Yield applies to all sorts of fabrication technologies, not just plain old silicon. Prices from apple.com as of 11/26/2007 Lecture 1: Introduction

  36. Complexity • Design time (microarchitecture) • Implementation time (circuit, layout engineers) • Validation/Verification (test before fab) • Debugging (test after fab) • Repeat… 2x performance / 18 months = 0.893% performance / week Each week of product delay had better earn you at least 0.9% performance! Impacts Time-to-Market Lecture 1: Introduction

  37. Verification • Intel Pentium FDIV bug • Verification/validation should catch this • It didn’t (last minute optimization, full validation not run) • Cost: ~ $500M • Complexity can be costly • Over half of the design effort is spent on verification Lecture 1: Introduction

  38. OS, Compilers, Applications, … • Some additional direct and indirect costs • Ex. MMX/SSE • Costs extra HW, design time, verification, etc. • Useless without cooperation from application writers • Intel has a lot of SW people in-house to work on new applications, or work with 3rd-parties to use new technologies in their applications • Danger: benefits on new computers, but compatibility issues with older computers • Ex. Multi-Core • Need support from OS vendors and application writers, otherwise no one can use the extra processors • Some of the cost shared by others; worthwhile investment for MSFT if you have to buy Vista for full multi-core support Lecture 1: Introduction

  39. Goal of Processor Design • Maximize performance ... Within the constraints of • Peak power, average power, die area, metal layers, thermals, implementation complexity, verification complexity, time-to-market, cost to manufacturer (Intel), cost to OEM (Dell), cost to end-customer (you) • Huge, multi-variable optimization problem! • Not all variables are independent • Not all variables have the same weight • The same variable may have different weights to different customers Lecture 1: Introduction

  40. Goal of Processor Design • Slightly different for different segments • Laptops: maximize performance and battery life • Embedded: attain “sufficient” performance and then maximize battery life • Your MP3 player only needs to be fast enough to run the MP3 codec; any additional performance provides no end-user benefit and just costs more/consumes more power • Server: throughput vs. latency • In this course, we will be mostly focused on “high-performance” processors (desktop, server) Lecture 1: Introduction

More Related