Computer Architecture Advanced Topics

Computer ArchitectureAdvanced Topics

Performance per Watt • Mobile smaller form-factor decreases power budget • Power generates heat, which must be dissipated to keep transistors within allowed temperature • Limits the processor’s peak power consumption • Change the target • Old target: get max performance • New target: get max performance at a given power envelope • Performance per Watt • Performance via frequency increase • Power = CV2f, but increasing f also requires increasing V • X% performance costs 3X% power • Assume performance linear with frequency • A power efficient feature – better than 1:3 performance : power • Otherwise it is better to just increase frequency (and voltage) • U-arch performance features should be power efficient

Fan 2% Intel® ICH3% LAN 2% DVD 2% Display (panel + inverter) 33% CLK 5% HDD 8% GFX 8% Misc. 8% CPU 10% Intel®MCH 9% Power Supply 10% Higher Performance vs.Longer Battery Life • Processor average power is <10% of the platform • The processor reduces power in periods of low processor activity • The processor enters lower power states in idle periods • Average power includes low-activity periods and idle-time • Typical: 1W – 3W • Max power limited by heat dissipation • Typical: 20W – 100W • Decision • Optimize for performance when Active • Optimize for battery life when idle

Leakage Power • The power consumed by a processor consists of • Active power: used to switch transistors • Leakage power: leakage of transistors under voltage • Leakage power is a function of • Number of transistors and their size • Operating voltage • Die temperature • Leakage power reduction • The LLC (Last Level Cache) is built with low-leakage transistors (2/3 of the die transistors) • Low-leakage transistors are slower, increasing cache access latency • The significant power saved justifies the small performance loss • Enhanced SpeedStep® technology • Reduces voltage and frequency on low processor activity

6.1X Efficiency ratio = 2.3 2.7X Enhanced SpeedStep™ Technology • The “Basic” SpeedStep™ Technology had • 2 operating points • Non-transparent switch • The “Enhanced” version provides • Multi voltage/frequency operating points • For example, 1.6GHz Pentium M processor operation ranges: • From 600MHz @ 0.956V • To 1.6GHz @ 1.484V • Transparent switch • Frequent switches • Benefits • Higher power efficiency2.7X lower frequency 2X performance loss >2X energy gain • Outstanding battery life

2nd Generation Intel®CoreTMSandy Bridge

2nd Gen Intel® Core™ Microarchitecture: Overview Integrates CPU, Graphics, MC, PCI Express* On Single Chip High BW/low-latency modular core/GFX interconnect Next Generation Intel® Turbo Boost Technology PCI Express* DMI x16 PCIe Substantial performance improvement System Agent IMC High Bandwidth Last Level Cache Display 2ch DDR3 Core LLC Intel® Advanced Vector Extension (Intel® AVX) Next Generation Processor Graphics and Media Core LLC Core LLC Integrated Memory Controller 2ch DDR3 Core LLC Embedded DisplayPort Graphics PECI Interface To Embedded Controller Intel® Hyper-Threading Technology 4 Cores / 8 Threads 2 Cores / 4 Threads Discrete Graphics Support: 1x16 or 2x8 NotebookDP Port PCH

ALU, SIMUL, DIV, FP MUL ALU, SIALU, FP ADD ALU, Branch, FP Shuffle Load Load Store Data Store Address Store Address Core Block Diagram Front End (IA instructions  Uops) Instruction Queue Pre decode Decoders 32k L1 Instruction Cache Decoders Decoders Decoders Branch Pred 1.5k uOP cache In Order Allocation, Rename, Retirement Allocate/Rename/Retire ReorderBuffers Zeroing Idioms Load Buffers Store Buffers In order Out-of-order Out of Order “Uop” Scheduling Scheduler Port 0 Port 1 Port 5 Port 2 Port 3 Port 4 Six Execution Ports Data Cache Unit L2 Cache (MLC) 48 bytes/cycle Fill Buffers 32k L1 Data Cache

Front End Instruction Queue Pre decode Decoders 32KB L1 I-Cache Decoders Decoders Decoders Branch Prediction Unit Instruction Fetch and Decode • 32KB 8-way Associative ICache • 4 Decoders, up to 4 instructions / cycle • Micro-Fusion • Bundle multiple instruction events into a single “Uops” • Macro-Fusion • Fuse instruction pairs into a complex “Uop” • Decode Pipeline supports 16 bytes per cycle

Decoded Uop Cache Instruction Queue Pre decode Decoders • 32KB L1 I-Cache Decoders Decoders Decoders Decoded Uop Cache ~1.5 Kuops Branch Prediction Unit Decoded Uop Cache • Instruction Cache for Uops instead of Instruction Bytes • ~80% hit rate for most applications • Higher Instruction Bandwidth and Lower Latency • Decoded Uop Cache can represent 32-byte / cycle • More Cycles sustaining 4 instruction/cycle • Able to ‘stitch’ across taken branches in the control flow

Branch Prediction Unit Instruction Queue Pre decode Decoders 32k L1 Instruction Cache Decoders Decoders Decoders Decoded Uop Cache ~1.5 Kuops Branch Prediction Unit New Branch Predictor • Twice as many targets • Much more effective storage for history • Much longer history for data dependent behaviors

Front End Zzzz Instruction Queue Pre decode Decoders 32k L1 Instruction Cache Decoders Decoders Decoders Decoded Uop Cache ~1.5 Kuops Branch Prediction Unit • Decoded Uop Cache lets the normal front end sleep • Decode one time instead of many times • Branch-Mispredictions reduced substantially • The correct path is also the most efficient path Save Power while Increasing Performance

“Out of Order” Part of the machine In Order Allocation, Rename, Retirement Allocate/Rename/Retire ReorderBuffers Zeroing Idioms Load Buffers Store Buffers In order Out-of-order Out of Order “Uop” Scheduling Scheduler Port 0 Port 1 Port 5 Port 2 Port 3 Port 4 • Receives Uops from the Front End • Sends them to Execution Units when they are ready • Retires them in Program Order • Increase Performance by finding more Instruction Level Parallelism • Increasing Depth and Width of machine implies larger buffers • More Data Storage, More Data Movement, More Power

Sandy Bridge Out-of-Order (OOO) Cluster Load Buffers Store Buffers ReorderBuffers Allocate/Rename/Retire Zeroing Idioms In order Out-of-order Scheduler FP/INT Vector PRF Int PRF • Method: Physical Reg File (PRF) instead of centralized Retirement Register File • Single copy of every data • No movement after calculation • Allows significant increase in buffer sizes • Dataflow window ~33% larger PRF has better than linear performance/power Key enabler for Intel®AVX

Intel® Advanced Vector Extensions • Vectors are a natural data-type for many applications • Extend SSE FP instruction set to 256 bits operand size • Intel AVX extends all 16 XMM registers to 256bits • New, non-destructive source syntax • VADDPS ymm1, ymm2, ymm3 • New Operations to enhance vectorization • Broadcasts • Masked load & store XMM0 YMM0 128 bits 256 bits (AVX) Wider vectors and non-destructive source specify more work with fewer instructions Extending the existing state is area and power efficient

Execution Cluster Scheduler sees matrix: • 3 “ports” to 3 “stacks” of execution units • General Purpose Integer • SIMD (Vector) Integer • SIMD Floating Point • Challenge: double the output of one of these stacks in a manner that is invisible to the others FP MUL VI MUL ALU Port 0 Port 1 Port 5 VI Shuffle Blend DIV GPR SIMD INT SIMD FP FP ADD VI ADD ALU VI Shuffle ALU FP Shuf FP Bool JMP Blend

Execution Cluster Solution: • Repurpose existing data paths to dual-use • SIMD integer and legacy SIMD FP use legacy stack style • Intel® AVX utilizes both 128-bit execution stacks • Double FLOPs • 256-bit Multiply + 256-bit ADD + 256-bit Load per clock FP Multiply FP MUL VI MUL ALU Port 0 Port 1 Port 5 FP Blend Blend VI Shuffle DIV GPR SIMD INT SIMD FP FP ADD FP ADD ALU VI ADD VI Shuffle FP Shuffle FP Shuf ALU FP Bool FP Boolean JMP Blend FP Blend

Memory Cluster Store Data Store Address Load Store Buffers Memory Control 256KB L2Cache (MLC) 32 bytes/cycle Fill Buffers 32KB 8-way L1 Data Cache • Memory Unit can service two memory requests per cycle • 16 bytes load and 16 bytes store per cycle • Goal:Maintain the historic bytes/flop ratio of SSE for Intel® AVX

Load Load Store Address Store Address Memory Cluster Store Data Store Buffers Memory Control • 256KB L2 Cache (MLC) 48 bytes/cycle Fill Buffers • 32KB 8-way L1 Data Cache • Solution : Dual-Use the existing connections • Make load/store pipes symmetric • Memory Unit services three data accesses per cycle • 2 read requests of up to 16 bytes AND 1 store of up to 16 bytes • Internal sequencer deals with queued requests • Second Load Port is one of highest performance features • Required to keep Intel® AVX fed • linear power/performance

Branch Pred 1.5k uOP cache ALU ALU ALU VI MUL VI ADD JMP VI Shuffle VI Shuffle AVX/FP Shuf Load Load DIV AVX/FP Bool AVX FP ADD Store Address Store Address AVX FP MUL AVX FP Blend AVX FP Blend Putting it togetherSandy Bridge Microarchitecture Instruction Queue Pre decode Decoders 32k L1 Instruction Cache Decoders Decoders Decoders Allocate/Rename/Retire ReorderBuffers Zeroing Idioms Load Buffers Store Buffers In order Out-of-order Scheduler Port 0 Port 1 Port 5 Port 2 Port 3 Port 4 Store Data Memory Control L2 Data Cache (MLC) 48 bytes/cycle Fill Buffers 32k L1 Data Cache

Other Architectural Extensions • Cryptography Instruction Throughput Enhancements • Increased throughput for AES instructions • Arithmetic Throughput Enhancements • ADC (Add with Carry) throughput doubled • Multiply (64-bit multiplicands with 128-bit product) • ~25% speedup on existing RSA binaries • State Save/Restore Enhancements • New state added in Intel® AVX • HW monitors features used by applications • Only saves/restores state that is used

2nd Gen Intel® Core™ Microarchitecture • System Agent, Ring Architecture and Other Innovationsin 2nd Generation Intel® Core™ Microarchitecture formerly codenamed Sandy Bridge x16 PCIe PCI Express DMI System Agent IMC Cores Display 2ch DDR3 Core LLC Core LLC Core LLC Graphics Core LLC Graphics PECI Interface To Embedded Controller NotebookDP Port DMI 2011 PCH

Integration: Optimization Opportunities • Dynamically redistribute power between Cores& Graphics • Tight power management control of all components, providing better granularity and deeper idle/sleep states • Three separate power/frequency domains: System Agent (Fixed), Cores+Ring, Graphics (Variable) • High BW Last Level Cache, shared among Cores and Graphics • Significant performance boost, saves memory bandwidth and power • Integrated Memory Controller and PCI Express ports • Tightly integrated with Core/Graphics/LLC domain • Provides low latency & low power – remove intermediate busses • Bandwidth is balanced across the whole machine, from Core/Graphics all the way to Memory Controller • Modular uArch for optimal cost/power/performance • Derivative products done with minimal effort/time

Scalable Ring On-die Interconnect • Ring-based interconnect between Cores, Graphics, Last Level Cache (LLC) and System Agent domain DMI PCI Express* System Agent IMC • Composed of 4 rings • 32 Byte Data ring, Request ring, Acknowledgering and Snoop ring • Fully pipelined at core frequency/voltage: bandwidth, latency and power scale with cores • Massive ring wire routing runs over the LLC with no area impact • Access on ring always picks the shortest path – minimize latency • Distributed arbitration, ring protocol handles coherency, ordering, and core interface • Scalable to servers with large number of processors Display Core LLC Core LLC Core LLC LLC Core Graphics High Bandwidth, Low Latency, Modular

Cache Box • Interface block • Between Core/Graphics/Media and the Ring • Between Cache controller and the Ring • Implements the ring logic, arbitration, cache controller • Communicates with System Agent for LLC misses, external snoops, non-cacheable accesses • Full cache pipeline in each cache box PCI Express* DMI System Agent IMC Display Core LLC Core LLC • Physical Addresses are hashed at the source to prevent hot spots and increase bandwidth • Maintains coherency and ordering for the addresses that are mapped to it • LLC is fully inclusive with “Core Valid Bits” – eliminates unnecessary snoops to cores • Runs at core voltage/frequency, scales with Cores Core LLC LLC Core Graphics Distributed coherency & ordering; Scalable Bandwidth, Latency & Power

LLC Sharing • LLC shared among all Cores, Graphics and Media • Graphics driver controls which streams are cached/coherent • Any agent can access all data in the LLC, independent of who allocated the line, after memory range checks • Controlled LLC way allocation mechanism to prevent thrashing between Core/graphics PCI Express* DMI System Agent IMC Display Core LLC Core LLC • Multiple coherency domains • IA Domain (Fully coherent via cross-snoops) • Graphic domain (Graphics virtual caches, flushed to IA domain by graphics engine) • Non-Coherent domain (Display data, flushed to memory by graphics engine) Core LLC LLC Core Much higher Graphics performance, DRAM power savings, more DRAM BW available for Cores Graphics

System Agent • Contains PCI Express, DMI, Memory Controller, Display Engine… • Contains Power Control Unit • Programmable uController, handles all power management and reset functions in the chip • Smart integration with the ring • Provides cores/Graphics /Media with high BW, low latency to DRAM/IO for best performance • Handles IO-to-cache coherency • Separate voltage and frequency from ring/cores, Display integration for better battery life • Extensive power and thermal management for PCI Express and DDR DMI PCI Express* System Agent IMC Display Core LLC Core LLC Core LLC LLC Core Graphics Smart I/O Integration

Hyper Threading Technology

Thread-Level Parallelism • Multiprocessor systems have been used for many years • There are known techniques to exploit multiprocessors • Software trends • Applications consist of multiple threads or processes that can be executed in parallel on multiple processors • Thread-level parallelism (TLP) – threads can be from • the same application • different applications running simultaneously • operating system services • Increasing single thread performance becomes harder • and is less and less power efficient • Chip Multi-Processing (CMP) • Two (or more) processors are put on a single die

Multi-Threading • Multi-threading: a single processor executes multiple threads • Time-slice multithreading • The processor switches between software threads after a fixed period • Can effectively minimize the effects of long latencies to memory • Switch-on-event multithreading • Switch threads on long latency events such as cache misses • Works well for server applications that have many cache misses • A deficiency of both time-slice MT and switch-on-event MT • They do not cover for branch mis-predictions and long dependencies • Simultaneous multi-threading (SMT) • Multiple threads execute on a single processor simultaneously w/o switching • Makes the most effective use of processor resources • Maximizes performance vs. transistor count and power

Hyper-threading (HT) Technology • HT is SMT • Makes a single processor appear as 2 logical processors = threads • Each thread keeps a its own architectural state • General-purpose registers • Control and machine state registers • Each thread has its own interrupt controller • Interrupts sent to a specific logical processor are handled only by it • OS views logical processors (threads) as physical processors • Schedule threads to logical processors as in a multiprocessor system • From a micro-architecture perspective • Thread share a single set of physical resources • caches, execution units, branch predictors, control logic, and buses

Two Important Goals • When one thread is stalled the other thread can continue to make progress • Independent progress ensured by either • Partitioning buffering queues and limiting the number of entries each thread can use • Duplicating buffering queues • A single active thread running on a processor with HT runs at the same speed as without HT • Partitioned resources are recombined when only one thread is active

Front End • Each thread manages its own next-instruction-pointer • Threads arbitrate Uop cache access every cycle (Ping-Pong) • If both want to access the UC – access granted in alternating cycles • If one thread is stalled, the other thread gets the full UC bandwidth • TC entries are tagged with thread-ID • Dynamically allocated as needed • Allows one logical processor to have more entries than the other UopCache

Front End (cont.) • Branch prediction structures are either duplicated or shared • The return stack buffer is duplicated • Global history is tracked for each thread • The large global history array is a shared • Entries are tagged with a logical processor ID • Each thread has its own ITLB • Both threads share the same decoder logic • if only one needs the decode logic, it gets the full decode bandwidth • The state needed by the decodes is duplicated • Uop queue is hard partitioned • Allows both logical processors to make independent forward progress regardless of FE stalls (e.g., TC miss) or EXE stalls

Out-of-order Execution • ROB and MOB are hard partitioned • Enforce fairness and prevent deadlocks • Allocator ping-pongs between the thread • A thread is selected for allocation if • Its uop-queue is not empty • its buffers (ROB, RS) are not full • It is the thread’s turn, or the other thread cannot be selected

Out-of-order Execution (cont) • Registers renamed to a shared physical register pool • Store results until retirement • After allocation and renaming uops are placed in one of 2 Qs • Memory instruction queue and general instruction queue • The two queues are hard partitioned • Uops are read from the Q’s and sent to the scheduler using ping-pong • The schedulers are oblivious to threads • Schedule uops based on dependencies and exe. resources availability • Regardless of their thread • Uops from the two threads can be dispatched in the same cycle • To avoid deadlock and ensure fairness • Limit the number of active entries a thread can have in each scheduler’s queue • Forwarding logic compares physical register numbers • Forward results to other uops without thread knowledge

Out-of-order Execution (cont) • Memory is largely oblivious • L1 Data Cache, L2 Cache, L3 Cache are thread oblivious • All use physical addresses • DTLB is shared • Each DTLB entry includes a thread ID as part of the tag • Retirement ping-pongs between threads • If one thread is not ready to retire uops all retirement bandwidth is dedicated to the other thread

LowPower Thread 1 executes HALT Thread 0 executes HALT ST0 ST1 Interrupt MT Thread 0 executes HALT Thread 1 executes HALT Single-task And Multi-task Modes • MT-mode (Multi-task mode) • Two active threads, with some resources partitioned as described earlier • ST-mode (Single-task mode) • There are two flavors of ST-mode • single-task thread 0 (ST0) – only thread 0 is active • single-task thread 1 (ST1) – only thread 1 is active • Resources that were partitioned in MT-mode are re-combined to give the single active logical processor use of all of the resources • Moving the processor from between modes

Operating System And Applications • An HT processor appears to the OS and application SW as 2 processors • The OS manages logical processors as it does physical processors The OS should implement two optimizations: • Use HALT if only one logical processor is active • Allows the processor to transition to either the ST0 or ST1 mode • Otherwise the OS would execute on the idle logical processor a sequence of instructions that repeatedly checks for work to do • This so-called “idle loop” can consume significant execution resources that could otherwise be used by the other active logical processor • On a multi-processor system, • Schedule threads to logical processors on different physical processors before scheduling multiple threads to the same physical processor • Allows SW threads to use different physical resources when possible

Computer Architecture Advanced Topics