530 likes | 662 Views
ELEC 669 Low Power Design Techniques Lecture 1. Amirali Baniasadi amirali@ece.uvic.ca. ELEC 669: Low Power Design Techniques. Instructor: Amirali Baniasadi EOW 441, Only by appt. Call or email with your schedule.
E N D
ELEC 669Low Power Design TechniquesLecture 1 Amirali Baniasadi amirali@ece.uvic.ca
ELEC 669: Low Power Design Techniques Instructor: Amirali Baniasadi EOW 441, Only by appt. Call or email with your schedule. Email: amirali@ece.uvic.ca Office Tel: 721-8613 Web Page for this class will be at http://www.ece.uvic.ca/~amirali/courses/ELEC669/elec669.html Will use paper reprints Lecture notes will be posted on the course web page.
CourseStructure • Lectures: • 1-2 weeks on processor review • 5 weeks on low power techniques • 6 weeks: discussion, presentation, meetings • Reading paper posted on the web for each week. • Need to bring a 1 page review of the papers. • Presentations: Each student should give to presentations in class.
Course Philosophy • Papers to be used as supplement for lectures (If a topic is not covered in the class, or a detail not presented in the class, that means I expect you to read on your own to learn those details) • One Project (50%) • Presentation (30%)- Will be announced in advance. • Final Exam: take home (20%) • IMPORTANT NOTE: Must get passing grade in all components to pass the course. Failing any of the three components will result in failing the course.
Project • More on project later
Topics • High Performance Processors? • Low-Power Design • Low Power Branch Prediction • Low-Power Register Renaming • Low-Power SRAMs • Low-Power Front-End • Low-Power Back-End • Low-Power Issue Logic • Low-Power Commit • AND more…
Fetch Decode Issue Complete Commit A Modern Processor 1-What do each do? 2-Possible Power Optimizations? Front-end Back-end
Power Breakdown Alpha21464 PentiumPro
Instruction Set Architecture (ISA) Fetch Instruction From Memory DecodeInstruction determine its size & action Fetch Operand data Execute instruction & compute results or status Store Result in memory Determine Next Instruction’s address • Instruction Execution Cycle
What Should we Know? • A specific ISA (MIPS) • Performance issues - vocabulary and motivation • Instruction-Level Parallelism • How to Use Pipelining to improve performance • Exploiting Instruction-Level Parallelism w/ Dynamic Approach • Memory: caches and virtual memory
What is Expected From You? • Read papers! • Be up-to-date! • Come back with your input & questions for discussion!
Power? • Everything is done by tiny switches • Their charge represents logic values • Changing charge energy • Power energy over time • Devices are non-ideal power heat • Excess heat Circuits breakdown Need to keep power within acceptable limits
Power as a Performance Limiter Conventional Performance Scaling: Goal: Max. performance w/ min cost/complexity How: -More and faster xtors. -More complex structures. Power: Don’t fix if it ain’t broken Not True Anymore: Power has increased rapidly Power-Aware Architecture a Necessity
Power-Aware Architecture Conventional Architecture: Goal: Max. performance How: Do as muchas you can. This WorkPower-Aware Architecture Goal: Min. Power and Maintain Performance How: Do as little as you can, while maintaining performance Challenging and new area
Why is this challenging • Identify actions that can be delayed/eliminated • Don’t touch those that boost performance • Cost/Power of doing so must not out-weight benefits
Definitions • Performance is in units of things-per-second • bigger is better • If we are primarily concerned with response time • performance(x) = 1 execution_time(x) " X is n times faster than Y" means Performance(X) n = ---------------------- Performance(Y)
Amdahl's Law Speedup due to enhancement E: ExTime w/o E Performance w/ E Speedup(E) = -------------------- = --------------------- ExTime w/ E Performance w/o E Suppose that enhancement E accelerates a fraction F of the task by a factor S and the remainder of the task is unaffected then, ExTime(with E) = ((1-F) + F/S) X ExTime(without E) Speedup(with E) = ExTime(without E) ÷ ((1-F) + F/S) X ExTime(without E) Speedup(with E) =1/ ((1-F) + F/S)
Amdahl's Law-example A new CPU makes Web serving 10 times faster. The old CPU spent 40% of the time on computation and 60% on waiting for I/O. What is the overall enhancement? Fraction enhanced= 0.4 Speedup enhanced = 10 Speedup overall = 1 = 1.56 0.6 +0.4/10
Why Do Benchmarks? • How we evaluate differences • Different systems • Changes to a single system • Provide a target • Benchmarks should represent large class of important programs • Improving benchmark performance should help many programs • For better or worse, benchmarks shape a field • Good ones accelerate progress • good target for development • Bad benchmarks hurt progress • help real programs v. sell machines/papers? • Inventions that help real programs don’t help benchmark
SPEC first round • First round 1989; 10 programs, single number to summarize performance • One program: 99% of time in single line of code • New front-end compiler could improve dramatically
SPEC95 • Eighteen application benchmarks (with inputs) reflecting a technical computing workload • Eight integer • go, m88ksim, gcc, compress, li, ijpeg, perl, vortex • Ten floating-point intensive • tomcatv, swim, su2cor, hydro2d, mgrid, applu, turb3d, apsi, fppp, wave5 • Must run with standard compiler flags • eliminate special undocumented incantations that may not even generate working code for real programs
CPU time = Seconds = Instructions x Cycles x Seconds Program Program Instruction Cycle Summary • Time is the measure of computer performance! • Remember Amdahl’s Law: Improvement is limited by unimproved part of program
Execution Cycle Obtain instruction from program storage Instruction Fetch Determine required actions and instruction size Instruction Decode Locate and obtain operand data Operand Fetch Compute result value or status Execute Deposit results in storage for later use Result Store Determine successor instruction Next Instruction
What Must be Specified? Instruction Fetch ° Instruction Format or Encoding – how is it decoded? ° Location of operands and result – where other than memory? – how many explicit operands? – how are memory operands located? – which can or cannot be in memory? ° Data type and Size ° Operations – what are supported ° Successor instruction – jumps, conditions, branches Instruction Decode Operand Fetch Execute Result Store Next Instruction
What Is an ILP? • Principle: Many instructions in the code do not depend on each other • Result: Possible to execute them in parallel • ILP: Potential overlap among instructions (so they can be evaluated in parallel) • Issues: • Building compilers to analyze the code • Building special/smarter hardware to handle the code • ILP: Increase the amount of parallelism exploited among instructions • Seeks Good Results out of Pipelining
What Is ILP? • CODE A: CODE B: • LD R1, (R2)100 LD R1,(R2)100 • ADD R4, R1 ADD R4,R1 • SUB R5,R1 SUB R5,R4 • CMP R1,R2 SW R5,(R2)100 • ADD R3,R1 LD R1,(R2)100 • Code A: Possible to execute 4 instructions in parallel. • Code B: Can’t execute more than one instruction per cycle. Code A has Higher ILP
In-Order A B C A: LD R1, (R2) B: ADD R3, R4 C: ADD R3, R5 D: CMP R3, R1 D Out of Order Execution Programmer: Instructions execute in-order Processor: Instructions may execute in any order if results remain the same at the end Out-of-Order B: ADD R3, R4 C: ADD R3, R5 A: LD R1, (R2) D: CMP R3, R1
Source instruction • Dependant instruction • Latency (clock cycles) • FP ALU op • Another FP ALU op • 3 • FP ALU op • Store double • 2 • Load double • FP ALU op • 1 • Load double • Store double • 0 Assumptions • Five-stage integer pipeline • Branches have delay of one clock cycle • ID stage: Comparisons done, decisions made and PC loaded • No structural hazards • Functional units are fully pipelined or replicated (as many times as the pipeline depth) • FP Latencies Integer load latency: 1; Integer ALU operation latency: 0
Simple Loop & Assembler Equivalent • for (i=1000; i>0; i--) x[i] = x[i] + s; • Loop: LD F0, 0(R1) ;F0=array element • ADDD F4, F0, F2 ;add scalar in F2 • SD F4 , 0(R1) ;store result • SUBI R1, R1, #8 ;decrement pointer 8bytes (DW) • BNE R1, R2, Loop ;branch R1!=R2 • x[i] & s are double/floating point type • R1 initially address of array element with the highest address • F2 contains the scalar value s • Register R2 is pre-computed so that 8(R2) is the last element to operate on
Unscheduled Loop: LD F0, 0(R1) stall ADDD F4, F0, F2 stall stall SD F4, 0(R1) SUBI R1, R1, #8 stall BNE R1, R2, Loop stall 10 clock cycles Can we minimize? Scheduled Loop: LD F0, 0(R1) SUBI R1, R1, #8 ADDD F4, F0, F2 stall BNE R1, R2, Loop SD F4, 8(R1) 6 clock cycles 3 cycles: actual work; 3 cycles: overhead Can we minimize further? Schedule • Source instruction • Dependant instruction • Latency (clock cycles) • FP ALU op • Another FP ALU op • 3 • FP ALU op • Store double • 2 • Load double • FP ALU op • 1 • Load double • Store double • 0 Where are the stalls?
LD F0, 0(R1) ADDD F4, F0, F2 SD F4 , 0(R1) SUBI R1, R1, #8 BNE R1, R2, Loop LD F0, 0(R1) ADDD F4, F0, F2 SD F4 , 0(R1) SUBI R1, R1, #8 BNE R1, R2, Loop LD F0, 0(R1) ADDD F4, F0, F2 SD F4 , 0(R1) SUBI R1, R1, #8 BNE R1, R2, Loop LD F0, 0(R1) ADDD F4, F0, F2 SD F4 , 0(R1) SUBI R1, R1, #8 BNE R1, R2, Loop Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD F4, 0(R1) LD F6, -8(R1) ADDD F8, F6, F2 SD F8, -8(R1) LD F10, -16(R1) ADDD F12, F10, F2 SD F12, -16(R1) LD F14, -24(R1) ADDD F16, F14, F2 SD F16, -24(R1) SUBI R1, R1, #32 BNE R1, R2, Loop Eliminate Incr, Branch LD F0, 0(R1) ADDD F4, F0, F2 SD F4 , 0(R1) SUBI R1, R1, #8 BNE R1, R2, Loop LD F0, -8(R1) ADDD F4, F0, F2 SD F4 , -8(R1) SUBI R1, R1, #8 BNE R1, R2, Loop LD F0, -16(R1) ADDD F4, F0, F2 SD F4 , -16(R1) SUBI R1, R1, #8 BNE R1, R2, Loop LD F0, -24(R1) ADDD F4, F0, F2 SD F4 , -24(R1) SUBI R1, R1, #32 BNE R1, R2, Loop Loop Unrolling Four copies of loop Four iteration code Assumption: R1 is initially a multiple of 32 or number of loop iterations is a multiple of 4
Loop: LD F0, 0(R1) stall ADDD F4, F0, F2 stall stall SD F4, 0(R1) LD F6, -8(R1) stall ADDD F8, F6, F2 stall stall SD F8, -8(R1) LD F10, -16(R1) stall ADDD F12, F10, F2 stall stall SD F12, -16(R1) LD F14, -24(R1) stall ADDD F16, F14, F2 stall stall SD F16, -24(R1) SUBI R1, R1, #32 stall BNE R1, R2, Loop stall Loop Unroll & Schedule Loop: LD F0, 0(R1) LD F6, -8(R1) LD F10, -16(R1) LD F14, -24(R1) ADDD F4, F0, F2 ADDD F8, F6, F2 ADDD F12, F10, F2 ADDD F16, F14, F2 SD F4, 0(R1) SD F8, -8(R1) SD F12, -16(R1) SUBI R1, R1, #32 BNE R1, R2, Loop SD F16, 8(R1) Schedule No stalls! 14 clock cycles or 3.5 per iteration Can we minimize further? 28 clock cycles or 7 per iteration Can we minimize further?
Summary Iteration 10 cycles Unrolling 7 cycles Scheduling Scheduling 6 cycles 3.5 cycles (No stalls)
Multiple Issue • Multiple Issue is the ability of the processor to start more than one instruction in a given cycle. • Superscalar processors • Very Long Instruction Word (VLIW) processors
Fetch Decode Issue Complete Commit A Modern Processor Multiple Issue Front-end Back-end
1990’s: Superscalar Processors • Bottleneck: CPI >= 1 • Limit on scalar performance (single instruction issue) • Hazards • Superpipelining? Diminishing returns (hazards + overhead) • How can we make the CPI = 0.5? • Multiple instructions in every pipeline stage (super-scalar) • 1 2 3 4 5 6 7 • Inst0 IF ID EX MEM WB • Inst1 IF ID EX MEM WB • Inst2 IF ID EX MEM WB • Inst3 IF ID EX MEM WB • Inst4 IF ID EX MEM WB • Inst5 IF ID EX MEM WB
Elements of Advanced Superscalars • High performance instruction fetching • Good dynamic branch and jump prediction • Multiple instructions per cycle, multiple branches per cycle? • Scheduling and hazard elimination • Dynamic scheduling • Not necessarily: Alpha 21064 & Pentium were statically scheduled • Register renaming to eliminate WAR and WAW • Parallel functional units, paths/buses/multiple register ports • High performance memory systems • Speculative execution
SS + DS + Speculation • Superscalar + Dynamic scheduling + Speculation Three great tastes that taste great together • CPI >= 1? • Overcome with superscalar • Superscalar increases hazards • Overcome with dynamic scheduling • RAW dependences still a problem? • Overcome with a large window • Branches a problem for filling large window? • Overcome with speculation
The Big Picture issue Static program Fetch & branch predict execution & Reorder & commit
Superscalar Microarchitecture Floating point register file Functional units Memory interface Floating point inst. buffer Inst. Cache Pre-decode Inst. buffer Decode rename dispatch Functional units and data cache Integer address inst buffer Integer register file Reorder and commit
Register renaming methods • First Method: • Physical register file vs. logical (architectural) register file. • Mapping table used to associate physical reg w/ current value of log. Reg • use a free list of physical registers • Physical register file bigger than log register file • Second Method: • physical register file same size as logical • Also, use a buffer w/ one entry per inst. Reorder buffer.
Loop: LD F0, 0(R1) stall ADDD F4, F0, F2 stall stall SD F4, 0(R1) LD F6, -8(R1) stall ADDD F8, F6, F2 stall stall SD F8, -8(R1) LD F10, -16(R1) stall ADDD F12, F10, F2 stall stall SD F12, -16(R1) LD F14, -24(R1) stall ADDD F16, F14, F2 stall stall SD F16, -24(R1) SUBI R1, R1, #32 stall BNE R1, R2, Loop stall Register Renaming Example Loop: LD F0, 0(R1) LD F6, -8(R1) LD F10, -16(R1) LD F14, -24(R1) ADDD F4, F0, F2 ADDD F8, F6, F2 ADDD F12, F10, F2 ADDD F16, F14, F2 SD F4, 0(R1) SD F8, -8(R1) SD F12, -16(R1) SUBI R1, R1, #32 BNE R1, R2, Loop SD F16, 8(R1) Schedule No stalls! 14 clock cycles or 3.5 per iteration Can we minimize further? 28 clock cycles or 7 per iteration Can we minimize further?
r0 r1 r2 r3 r4 R8 r0 r1 r2 r3 r4 R8 R7 R7 R5 R5 R1 R2 R9 R9 R2 R6 R13 R6 R13 Register renaming: first method Mapping table Mapping table Add r3,r3,4 Free List Free List
Superscalar Processors • Issues varying number of instructions per clock • Scheduling: Static (by the compiler) or dynamic(by the hardware) • Superscalar has a varying number of instructions/cycle (1 to 8), scheduled by compiler or by HW (Tomasulo). • IBM PowerPC, Sun UltraSparc, DEC Alpha, HP 8000
More Realistic HW: Register Impact • Effect of limiting the number of renaming registers FP: 11 - 45 Integer: 5 - 15 IPC
Reorder Buffer • Place data in entry when execution finished Reserve entry at tail when dispatched Remove from head when complete Bypass to other instructions when needed
R8 R7 R5 rob6 R9 register renaming:reorder buffer Before add r3,r3,4 Add r3, rob6, 4 add rob8,rob6,4 r0 r1 r2 r3 r4 r0 r1 r2 r3 r4 R8 R7 R5 rob8 R9 8 7 6 0 7 6 0 R3 0 R3 …. r3 …..….. Reorder buffer Reorder buffer
Instruction Buffers Floating point register file Functional units Memory interface Floating point inst. buffer Inst. Cache Pre-decode Inst. buffer Decode rename dispatch Functional units and data cache Integer address inst buffer Integer register file Reorder and commit
Issue Buffer Organization • a) Single, shared queue b)Multiple queue; one per inst. type No out-of-order No Renaming No out-of-order inside queues Queues issue out of order