760 likes | 864 Views
Mikko H. Lipasti Associate Professor Department of Electrical and Computer Engineering University of Wisconsin—Madison. Lazy Logic. http://www.ece.wisc.edu/~pharm. CMOS History. CMOS has been a faithful servant 40+ years since invention Tremendous advances Device size, integration level
E N D
Mikko H. Lipasti Associate Professor Department of Electrical and Computer Engineering University of Wisconsin—Madison Lazy Logic http://www.ece.wisc.edu/~pharm
Mikko Lipasti, University of Wisconsin Seminar--University of Toronto CMOS History CMOS has been a faithful servant 40+ years since invention Tremendous advances Device size, integration level Voltage scaling Yield, manufacturability, reliability Nearly 20 years now as high-performance workhorse Result: life has been easy for architects Ease leads to complacency & laziness
Mikko Lipasti, University of Wisconsin Seminar--University of Toronto CMOS Futures “The reports of my demise are greatly exaggerated.” – Mark Twain CMOS has some life left in it Device scaling will continue What comes after CMOS… Many new challenges Process variability Device reliability Leakage power Dynamic power Focus of this talk
Mikko Lipasti, University of Wisconsin Seminar--University of Toronto Dynamic Power • Static CMOS: current flows when transistors switch • Combinational logic evaluates new inputs • Flip-flop, latch captures new value (clock edge) • Terms • C: capacitance of circuit • wire length, number and size of transistors • V: supply voltage • A: activity factor • f: frequency • Architects can/should focus on Ci x Ai • Reduce capacitance of each unit • Reduce activity of each unit
Mikko Lipasti, University of Wisconsin Seminar--University of Toronto Design Objective Inversion Historically, hardware was expensive Every gate, wire, cable, unit mattered Squeeze maximum utilization from each Now, power is expensive On-chip devices & wires, not so much Should minimize Ci x Ai Logic should be simple, infrequently used Both sequential and combinational Lazy Logic
Mikko Lipasti, University of Wisconsin Seminar--University of Toronto Talk Outline Motivation What is Lazy Logic? Applications of Lazy Logic Circuit-switched coherence Stall-cycle redistribution Dynamic scheduling Conclusions Research Group Overview
Mikko Lipasti, University of Wisconsin Seminar--University of Toronto What is Lazy Logic? Design philosophy Some overall principles Minimize unit utilization Minimize unit complexity OK to increase number of units/wires/devices As long as reduced Ai (activity) compensates Don’t forget leakage Result Reject conventional “good ideas” Reduce power without loss of performance Sometimes improve performance
Mikko Lipasti, University of Wisconsin Seminar--University of Toronto Lazy Logic Applications CMP interconnection networks Old: Packet-switched, store-and-forward New: Circuit-switched, reconfigurable Stall cycle redistribution Transparent pipelines want fine-grained stalls Redistribute coarse stalls into fine stalls High-performance dynamic scheduling Cycle time goal achieved by replicating ALUs
Mikko Lipasti, University of Wisconsin Seminar--University of Toronto CMP Interconnection Networks Options Buses don’t scale Crossbars are too expensive Rings are too slow Packet-switched mesh Attractive for all the DSM reasons Scalable Low latency High link utilization
Mikko Lipasti, University of Wisconsin Seminar--University of Toronto CMP Interconnection Networks But… Cables/traces are now on-chip wires Fast, cheap, plentiful Short: 1 cycle per hop Router latency adds up 3-4 cpu cycles per hop Store-and-forward Lots of activity/power Is this the right answer?
Mikko Lipasti, University of Wisconsin Seminar--University of Toronto Circuit-switched Interconnects Communication patterns Spatial locality to memory Pairwise communication Circuit-switched links Avoid switching/routing Reduce latency Save power?
Mikko Lipasti, University of Wisconsin Seminar--University of Toronto Router Design Switches can be logically configured to appear as wires (no routing overhead) Can also act as packet-switched network Can switch back and forth very easily Detailed router design not presented here P N S E W
Mikko Lipasti, University of Wisconsin Seminar--University of Toronto Dirty Miss coverage
Mikko Lipasti, University of Wisconsin Seminar--University of Toronto Directory Protocol Initial 3-hop miss establishes CS path Subsequent miss requests Sent directly on CS path to predicted owner Also in parallel to home node Predicted owner sources data early Directory acks update to sharing list Benefits Reduced 3-hop latency Less activity, less power
Mikko Lipasti, University of Wisconsin Seminar--University of Toronto Circuit-switched Performance
Mikko Lipasti, University of Wisconsin Seminar--University of Toronto Link Activity
Mikko Lipasti, University of Wisconsin Seminar--University of Toronto Buffer Activity
Mikko Lipasti, University of Wisconsin Seminar--University of Toronto Circuit-switched Coherence Summary Reconfigurable interconnect Circuit-switched links Some performance benefit Substantial reduction in activity Current status (slides are out of date) Router design and physical/area models Protocol tuning and tweaks, etc. Initial results in CA Letters paper
Mikko Lipasti, University of Wisconsin Seminar--University of Toronto Talk Outline Motivation What is Lazy Logic? Applications of Lazy Logic Circuit-switched coherence Stall-cycle redistribution Dynamic scheduling Conclusions Research Group Overview
Pipeline Clocking Revisited Eric L. Hill – Preliminary Exam B A • Conventional pipeline clock gating • Each valid work unit gets clocked into each latch • This is needlessly conservative Two units of work, 10 clock pulses Latches clocked to propagate data
Transparent Pipeline Gating Eric L. Hill – Preliminary Exam A B • Transparent pipelining: novel approach to clocking [Jacobsen 2004, 2005] • Both master and slave latch can remain transparent • Gating logic ensures no races • Pipeline registers are clocked lazily only when race occurs • Quite effective for low utilization pipelines • Gaps between valid work units enable transparent mode Two units of work, 5 clock pulses return
Mikko Lipasti, University of Wisconsin Seminar--University of Toronto Applications Best suited for low utilization pipelines E.g. FP, Media processing functional units High utilization pipelines see least benefit E.g. Instruction fetch pipelines To benefit from transparent approach: Valid data items need fine-grained gaps (stalls) 1-cycle gap provides lion’s share (50%)
Mikko Lipasti, University of Wisconsin Seminar--University of Toronto Application: Front-end Pipelines Provide back-end with sufficient supply of instructions to find ILP High branch prediction accuracy Low instruction cache miss rates Little opportunity for clock gating Designed to feed peak demand Poor match for transparent pipeline gating
Mikko Lipasti, University of Wisconsin Seminar--University of Toronto In-Order Execution Model • In-order Cores • Power efficient • Low design complexity • Throughput oriented CMP systems trending towards simple cores (e.g. Sun Niagara) • Data dependences cause fine-grained stalls at dispatch • Can we project these back to fetch? • Exploit fetch slack time
Pipeline Diagram Eric L. Hill – Preliminary Exam Issue Buffer Bpred clock vector PC RP Instruction Fetch Execution Core 0x0 bpred update
Mikko Lipasti, University of Wisconsin Seminar--University of Toronto Available Fetch Slack
Mikko Lipasti, University of Wisconsin Seminar--University of Toronto Implementation Stall cycle bits embedded in BTB EPIC ISAs (IA64) could use stop bits Verify prediction by observing unperturbed groups Let high confidence groups periodically execute unperturbed Observe overall increase in execution time Modeled Cell PPU-like PowerPC core with aggressive clock gating
Mikko Lipasti, University of Wisconsin Seminar--University of Toronto Latch Activity Reduction
Mikko Lipasti, University of Wisconsin Seminar--University of Toronto FE Energy Delay Product
Mikko Lipasti, University of Wisconsin Seminar--University of Toronto Stall Cycle Redistribution Summary [ISLPED 2006] Transparent pipelines reduce latch activity Not effective in pipelines with coarse-grained stalls (e.g. fetch) Coarse-grained stalls can be redistributed without affecting performance (fetch slack) Benefits Equivalent performance, lower power Transparent fetch pipeline now attractive
Mikko Lipasti, University of Wisconsin Seminar--University of Toronto Talk Outline Motivation What is Lazy Logic? Applications of Lazy Logic Circuit-switched coherence Stall-cycle redistribution Dynamic scheduling Conclusions Research Group Overview
Mikko Lipasti, University of Wisconsin Seminar--University of Toronto A Brief Scheduler Overview Fetch Fetch Fetch Fetch Fetch Fetch Fetch Fetch Decode Decode Decode Decode Decode Decode Decode Decode Schedule Schedule Schedule Schedule Schedule Schedule Schedule Sched /Exe Dispatch Dispatch Dispatch Dispatch Dispatch Dispatch Dispatch Writeback RF RF RF RF RF RF RF Commit Exe Exe Exe Exe Exe Exe Exe Writeback Writeback Writeback Writeback Writeback Writeback Writeback Commit Commit Commit Commit Commit Commit Commit Wakeup /Select Wakeup /Select wakeup/ select Atomic Sched/Exe Speculatively issued instructions Speculatively issued instructions Speculatively issued instructions Speculatively issued instructions Speculatively issued instructions Speculatively issued instructions Speculatively issued instructions Speculatively issued instructions Speculatively issued instructions Fetch Fetch Fetch Fetch Fetch Fetch Fetch Fetch Decode Decode Decode Decode Decode Decode Decode Decode Schedule Schedule Schedule Schedule Schedule Schedule Schedule Schedule Dispatch Dispatch Dispatch Dispatch Dispatch Dispatch Dispatch Dispatch RF RF RF RF RF RF RF RF Exe Exe Exe Exe Exe Exe Exe Exe Writeback /Recover Writeback /Recover Writeback /Recover Writeback /Recover Writeback /Recover Writeback /Recover Writeback /Recover Writeback /Recover Commit Commit Commit Commit Commit Commit Commit Commit Fetch Decode Schedule Dispatch RF Exe Writeback /Recover Commit Latency Changed!! Invalid input value Spec wakeup /select Spec wakeup /select Re-schedule when latency mispredicted Re-schedule when latency mispredicted Re-schedule when latency mispredicted Re-schedule when latency mispredicted Re-schedule when latency mispredicted Re-schedule when latency mispredicted Re-schedule when latency mispredicted Re-schedule when latency mispredicted • Data capture/ non-data capture scheduler • Speculative scheduling • Data capture scheduler desirable for many reasons • Cycle time is not competitive because of data path delay • Current machines use speculative scheduling • Misscheduled/replayed instructions burn power • Depending on recovery policy, up to 17% issued insts need to replay
Mikko Lipasti, University of Wisconsin Seminar--University of Toronto Slicing the Core Bitslice the core: narrow (16b) and wide (64b) Narrow core can be full data capture Still makes aggressive cycle time (with lazy logic) Completely nonspeculative, virtually no replays Further power benefits (not in this talk) Back-End Front-End OoO Core
Mikko Lipasti, University of Wisconsin Seminar--University of Toronto Dynamic Scheduling with Partial Operand Values Fetch Fetch Fetch Fetch Fetch Fetch Fetch Decode Decode Decode Decode Decode Decode Decode Sched & Nrw Exe Sched & Nrw Exe Sched & Nrw Exe Sched & Nrw Exe Sched & Nrw Exe Sched & Nrw Exe Sched & Nrw Exe Dispatch Dispatch Dispatch Dispatch Dispatch Dispatch Dispatch RF RF RF RF RF RF RF Exe Exe Exe Exe Exe Exe Exe Writeback /Recover Writeback /Recover Writeback /Recover Writeback /Recover Writeback /Recover Writeback /Recover Writeback /Recover Commit Commit Commit Commit Commit Commit Commit wakeup /select wakeup /select the rest of the data • Narrow core • Computes partial operand • Determines load latency • Avoids misscheduling • Wide core • Computes the rest of the operand (if needed)
Mikko Lipasti, University of Wisconsin Seminar--University of Toronto Scheduler w/ Narrow Data-Path Increased cycle time • Non-data capture scheduler Select – mux – tag bcast & compare – ready wr • Naïve narrow data capture scheduler Select – mux – tag bcast & compare – ready wr Select – mux – narrow ALU – data bcast – data wr
Mikko Lipasti, University of Wisconsin Seminar--University of Toronto Scheduler w/ Embedded ALUs • With embedded ALUs Select – mux – tag bcast & compare – ready wr Max(select, data bcast – mux – narrow ALU) – mux – latch setup • Lazy Logic • Replicated ALUs • Low utilization • Off critical delay path
Mikko Lipasti, University of Wisconsin Seminar--University of Toronto Cycle Time, Area, Energy 32 entries, implemented using verilog Synthesized using Synopsis Design Compiler and LSI Logic’s gflxp 0.11um Cycle Time (ns) Area (mm2) Energy (nJ) Full-Data Capture 2.04 1.98 1.40 Narrow-Data Capture 1.71 1.49 1.46 Narrow-Data Capture w/ ALUs 1.28 1.53 1.48 Non-Data Capture 1.28 1.43 1.54
Mikko Lipasti, University of Wisconsin Seminar--University of Toronto Dynamic Scheduling Summary Benefits: [JILP 2007] Save 25-30% of total OoO window energy=> 12-18% total dynamic chip power Reduce misspeculated loads by 75%-80% Slightly improved IPC Comparable cycle time Enabled by: Lazy narrow ALUs ALUs are cheap, so compute in parallel with scheduling select logic
Mikko Lipasti, University of Wisconsin Seminar--University of Toronto Talk Outline Motivation What is Lazy Logic? Applications of Lazy Logic Circuit-switched coherence Stall-cycle redistribution Dynamic scheduling Conclusions Research Group Overview
Mikko Lipasti, University of Wisconsin Seminar--University of Toronto Conclusions Lazy Logic Promising new design philosophy Some overall principles Minimize unit utilization Minimize unit complexity OK to increase number of units/wires/devices Initial Results Circuit-switched CMP interconnects Stall cycle redistribution Dynamic Scheduling
Mikko Lipasti, University of Wisconsin Seminar--University of Toronto Who Are We? Faculty: Mikko Lipasti Current Ph.D. students: Profligate execution: Gordie Bell (joining IBM in 2006) Coarse-grained coherence: Jason Cantin (joining IBM in 2006) Lazy Logic Circuit-switched coherence: Natalie Enright Stall cycle redistribution: Eric Hill Dynamic scheduling: Erika Gunadi Dynamic code optimization: Lixin Su SMT/CMP scheduling/resource allocation: Dana Vantrease Pharmed out: IBM: Trey Cain, Brian Mestan AMD: Kevin Lepak Intel: Ilhyun Kim, Morris Marden, Craig Saldanha, Madhu Seshadri Sun Microsystems: Matt Ramsay, Razvan Cheveresan, Pranay Koka
Mikko Lipasti, University of Wisconsin Seminar--University of Toronto Research Group Overview Faculty: Mikko Lipasti, since 1999 Current MS/PhD students Gordie Bell, Natalie Enright Jerger, Erika Gunadi, Atif Hashmi, Eric Hill, Lixin Su, Dana Vantrease Graduates, current employment: AMD: Kevin Lepak IBM: Trey Cain, Jason Cantin, Brian Mestan Intel: Ilhyun Kim, Morris Marden, Craig Saldanha, Madhu Seshadri Sun Microsystems: Matt Ramsay, Razvan Cheveresan, Pranay Koka
Mikko Lipasti, University of Wisconsin Seminar--University of Toronto Current Focus Areas Multiprocessors Coherence protocol optimization Interconnection network design Fairness issues in hierarchical systems Microprocessor design Complexity-effective microarchitecture Scalable dynamic scheduling hardware Speculation reduction for power savings Transparent clock gating Domain-specific ISA extensions Software Java Virtual Machine run-time optimization Workload development and characterization
Mikko Lipasti, University of Wisconsin Seminar--University of Toronto Funding IBM Faculty Partnership Awards Shared University Research equipment Intel Research council support Equipment donations National Science Foundation CSA, ITR, NGS, CPA Career Award Schneider ECE Faculty Fellowship UW Graduate School
Mikko Lipasti, University of Wisconsin Seminar--University of Toronto Questions? http://www.ece.wisc.edu/~pharm
Mikko Lipasti, University of Wisconsin Seminar--University of Toronto Questions?
Mikko Lipasti, University of Wisconsin Seminar--University of Toronto Backup slides
Mikko Lipasti, University of Wisconsin Seminar--University of Toronto Technology Parameters 65 nm technology generation 16 tiled processors Approximately 4 mm x 4mm Signal can travel approximately 4 mm/cycle Circuit switched interconnect consists of 5 mm unidirectional links
Mikko Lipasti, University of Wisconsin Seminar--University of Toronto Broadcast Protocol Broadcast to all nodes Establish Circuit-Switched path with owner of data Future broadcasts will use Circuit-Switched path to reduce power Predict when CS path will suffice Use LRU information for paths to tear down old paths when resources need to be claimed by new path
Mikko Lipasti, University of Wisconsin Seminar--University of Toronto Switch Design from paper CM Processor CM N CM E S CM W CM Buffer = Configuration Memory CM