Computer Architecture Advanced Topics

Computer ArchitectureAdvanced Topics

Pentium® M Processor

From Pentium® M Processor • Intel’s 1st processor designed for mobility • Achieve best performance at given power and thermal constraints • Achieve longest battery life src: http://www.anandtech.com

Example Standby Bridge • Use Moor’s Law and process improvements to: • Power/Performance • Integration • Reduce communication • Reduce Latencies • (cost in complexity) • More Performance and Efficiency via : • Speed Step • Memory Hierarchy • Multi-Core • Multi-Thread • Out-of-Order Execution • Predictors • Multi-Operand (vector) Instructions • Custom Processing src: http://www.anandtech.com

Performance per Watt • Mobile’s smaller form-factor decreases power budget • Power generates heat, which must be dissipated to keep transistors within allowed temperature • Limits the processor’s peak power consumption • Change the target • Old target: get max performance • New target: get max performance at a given power envelope • Performance per Watt • Performance via frequency increase • Power = CV2f, but increasing f also requires increasing V • X% performance costs 3X% power • Assume performance linear with frequency • A power efficient feature – better than 1:3 performance : power • Otherwise it is better to just increase frequency • All Banias u-arch features (aimed at performance) are power efficient

Fan 2% Intel® ICH3% LAN 2% DVD 2% Display (panel + inverter) 33% CLK 5% HDD 8% GFX 8% Misc. 8% CPU 10% Intel®MCH 9% Power Supply 10% Higher Performance vs.Longer Battery Life YesterdayNumbers • Processor average power is <10% of the platform • The processor reduces power in periods of low processor activity • The processor enters lower power states in idle periods • Average power includes low-activity periods and idle-time • Typical: 1W – 3W • Max power limited by heat dissipation • Typical: 20W – 100W • Decision • Optimize for performance when Active • Optimize for battery life when idle src: http://www.anandtech.com

Higher Performance vs.Longer Battery Life Today • High Dynamic Range • Long periods of Idle w/ picks of activity • Minimize power when Idle • Adequate performance when active • Quick transitions • Max power limited by heat dissipation • Typical: 3W (cell) – 6W (tablet) 15W (small PC) 60W (main stream PC) 150W+ (desktop) • How can the design fit all ? • Decision • Optimize for user experience when Active (adequate performance) • Optimize for battery life when idle src: http://www.anandtech.com

Static Power • The power consumed by a processor consists of • Active power: used to switch transistors • Static power: leakage of transistors under voltage • Static power is a function of • Number of transistors and their type • Operating voltage • Die temperature • Leakage is growing dramatically in new process technologies • Pentium® M reduces static power consumption • The L2 cache is built with low-leakage transistors (2/3 of the die transistors) • Low-leakage transistors are slower, increasing cache access latency • The significant power saved justifies the small performance loss • Enhanced SpeedStep® technology • Reduces voltage and temperature on low processor activity

Less is More • Less instructions per task • Advanced branch prediction reduces #wrong instructions executed • SSE instructions reduce the number of instructions architecturally • Less uops per instruction • Uops fusion • Dedicated stack engine • Less transistor switches per micro-op • efficient bus • various lower-level optimizations • Less energy per transistor switch • Enhanced SpeedStep® technology Power-awareness top to bottom

Improved Branch Predictor • Pentium® M employs best-in-class branch prediction • Bimodal predictor, Global predictor, Loop detector • Indirect branch predictor • Reduces number of wrong instructions executed • Saves energy spent executing wrong instructions • Loop predictor • Analyzes branches for loop behavior • Moving in one direction (taken or NT) a fixed number of times • Ended with a single movement in the opposite direction • Detect exact loop count • Loop predicted accurately

hit HIT Target Array indirect branch Branch IP target Predicted Target target iTA global history hit Indirect Branch Predictor • Indirect jumps are widely used in object-oriented code (C++, Java) • Targets are data dependent • Resolved at execution  high misprediction penalty • Initially, allocate indirect branch only in target array (TA) • If TA mispredicts allocate in iTA according to global history • Multiple targets allocated for a given branch • Indirects with a single target predicted by TA, saving iTA space • Use iTA if TA indicates indirect branch + iTA hits

Dedicated Stack Engine • PUSH, POP, CALL, RET update ESP (add or sub an offset) • Use a dedicated add uop • Track the ESP offset at the front-end • ID maintains offset in ESP_delta (+/- Osize) • Eliminates need for uops updating ESP • Patch displacements of stack operations • In some cases, ESP actual value is needed • For example: add eax, esp, 3 • A sync uop is inserted before the instruction • if ESP_delta != 0 • ESP = ESP + ESP_delta • Reset ESP_delta • ESP_delta recovered on jump misprediction

ESP Tracking Example Δ = 0 PUSH eax PUSH ebx INC eax INC esp ESP = ESP - 4 STORE [ESP], EAX ESP = ESP - 4 STORE [ESP], EBX EAX = ADD EAX, 1 ESP = ADD ESP, 1 Δ = Δ - 4 Δ = - 4 STORE [ESP-4], EAX Δ = Δ - 4 Δ = - 8 STORE [ESP-8], EBX EAX = ADD EAX, 1 Δ = - 8 Sync ESP ! ESP = SUB ESP, 8 Δ = - 8 Δ = 0 ESP = ADD ESP, 1 Δ = 0

Uop Fusion • The Instruction Decoder breaks an instruction into uops • A conventional uop consists of a single operation operating on two sources • An instruction requires multiple uops when • the instruction operates on more than two sources, or • the nature of the operation requires a sequence of operations • Uop fusion: in some cases the decoder fuses 2 uops into one uop • A short field added to the uop to support fusing of specific uop pairs • Uop fusion reduces the number of uops by 10% • Increases performance by effectively widening rename, and retire bandwidth • More instructions can be decode by all decoders • The same task is accomplished by processing fewer uops • Decreases the energy required to complete a given task

Scheduler A 2-uop Load-Op Load-op with 3 reg. operands add eax,[ebp+4*esi+8] Decoder Decoded into 2 uops LD: read data from mem OP: reg ← reg op data tmp=load[ebp+4*esi+8] LD OP eax = eax + tmp LD OP The LD and OP are inherently serial OP dispatched only when LD completes MEU ALU OP LD

Scheduler A 1-uop Load-Op add eax,[ebp+4*esi+8] Decoder Decoded into 1 uop Fused uops has a 3rd source – new field in uop holds index register Increase decode BW eax = eax + load[ebp+4*esi+8] LD + OP LD + OP Increase alloc BW and ROB/RS effective size Dispatched twice OP dispatched after LD Cache ALU OP LD fused uop retires after both LD&OP complete Increase retire BW

6.1X Efficiency ratio = 2.3 2.7X Enhanced SpeedStep™ Technology • The “Basic” SpeedStep™ Technology had • 2 operating points • Non-transparent switch • The “Enhanced” version provides • Multi voltage/frequency operating points. The Pentium M processor 1.6GHz operation ranges: • From 600MHz @ 0.956V • To 1.6GHz @ 1.484V • Transparent switch • Frequent switches • Benefits • Higher power efficiency2.7X lower frequency 2X performance loss >2X energy gain • Outstanding battery life • Excellent thermal mgmt.

Trace Cache(Pentium® 4 Processor)

Trace Cache • Decoding several IA-32 inst/clock at high frequency is difficult • Instructions have a variable length and have many different options • Takes several pipe-stages • Adds to the branch mis-prediction penalty • Trace-cache: cache uops of previously decoded instructions • Decoding is only needed for instructions that miss the TC • The TC is the primary (L1) instruction cache • Holds 12K uops • 8-way set associative with LRU replacement • The TC has its own branch predictor (Trace BTB) • Predicts branches that hit in the TC • Directs where instruction fetching needs to go next in the TC

Jump into the line Jump out of the line jmp jmp jmp jmp Traces • Instruction caches fetch bandwidth is limited to a basic blocks • Cannot provide instructions across a taken branch in the same cycle • The TC builds traces: program-ordered sequences of uops • Allows the target of a branch to be included in the same TC line as the branch itself • Traces have variable length • Broken into trace lines, six uops per trace line • There can be many trace lines in a single trace jmp

Hyper Threading Technology(Pentium® 4 Processor ) Based on Hyper-Threading Technology Architecture and Micro-architecture Intel Technology Journal

Thread-Level Parallelism • Multiprocessor systems have been used for many years • There are known techniques to exploit multiprocessors • Software trends • Applications consist of multiple threads or processes that can be executed in parallel on multiple processors • Thread-level parallelism (TLP) – threads can be from • the same application • different applications running simultaneously • operating system services • Increasing single thread performance becomes harder • and is less and less power efficient • Chip Multi-Processing (CMP) • Two (or more) processors are put on a single die

Multi-Threading • Multi-threading: a single processor executes multiple threads • Time-slice multithreading • The processor switches between software threads after a fixed period • Can effectively minimize the effects of long latencies to memory • Switch-on-event multithreading • Switch threads on long latency events such as cache misses • Works well for server applications that have many cache misses • A deficiency of both time-slice MT and switch-on-event MT • They do not cover for branch mis-predictions and long dependencies • Simultaneous multi-threading (SMT) • Multiple threads execute on a single processor simultaneously w/o switching • Makes the most effective use of processor resources • Maximizes performance vs. transistor count and power

Hyper-threading (HT) Technology • HT is SMT • Makes a single processor appear as 2 logical processors = threads • Each thread keeps a its own architectural state • General-purpose registers • Control and machine state registers • Each thread has its own interrupt controller • Interrupts sent to a specific logical processor are handled only by it • OS views logical processors (threads) as physical processors • Schedule threads to logical processors as in a multiprocessor system • From a micro-architecture perspective • Thread share a single set of physical resources • caches, execution units, branch predictors, control logic, and buses

Two Important Goals • When one thread is stalled the other thread can continue to make progress • Independent progress ensured by either • Partitioning buffering queues and limiting the number of entries each thread can use • Duplicating buffering queues • A single active thread running on a processor with HT runs at the same speed as without HT • Partitioned resources are recombined when only one thread is active

TC Hit TC Miss Front End • Each thread manages its own next-instruction-pointer • Threads arbitrate TC access every cycle (Ping-Pong) • If both want to access the TC – access granted in alternating cycles • If one thread is stalled, the other thread gets the full TC bandwidth • TC entries are tagged with thread-ID • Dynamically allocated as needed • Allows one logical processor to have more entries than the other

Front End (cont.) • Branch prediction structures are either duplicated or shared • The return stack buffer is duplicated • Global history is tracked for each thread • The large global history array is a shared • Entries are tagged with a logical processor ID • Each thread has its own ITLB • Both threads share the same decoder logic • if only one needs the decode logic, it gets the full decode bandwidth • The state needed by the decodes is duplicated • Uop queue is hard partitioned • Allows both logical processors to make independent forward progress regardless of FE stalls (e.g., TC miss) or EXE stalls

Out-of-order Execution • ROB and MOB are hard partitioned • Enforce fairness and prevent deadlocks • Allocator ping-pongs between the thread • A thread is selected for allocation if • Its uop-queue is not empty • its buffers (ROB, RS) are not full • It is the thread’s turn, or the other thread cannot be selected

Out-of-order Execution (cont) • Registers renamed to a shared physical register pool • Store results until retirement • After allocation and renaming uops are placed in one of 2 Qs • Memory instruction queue and general instruction queue • The two queues are hard partitioned • Uops are read from the Q’s and sent to the scheduler using ping-pong • The schedulers are oblivious to threads • Schedule uops based on dependencies and exe. resources availability • Regardless of their thread • Uops from the two threads can be dispatched in the same cycle • To avoid deadlock and ensure fairness • Limit the number of active entries a thread can have in each scheduler’s queue • Forwarding logic compares physical register numbers • Forward results to other uops without thread knowledge

Out-of-order Execution (cont) • Memory is largely oblivious • L1 Data Cache, L2 Cache, L3 Cache are thread oblivious • All use physical addresses • DTLB is shared • Each DTLB entry includes a thread ID as part of the tag • Retirement ping-pongs between threads • If one thread is not ready to retire uops all retirement bandwidth is dedicated to the other thread

LowPower Thread 1 executes HALT Thread 0 executes HALT ST0 ST1 Interrupt MT Thread 0 executes HALT Thread 1 executes HALT Single-task And Multi-task Modes • MT-mode (Multi-task mode) • Two active threads, with some resources partitioned as described earlier • ST-mode (Single-task mode) • There are two flavors of ST-mode • single-task thread 0 (ST0) – only thread 0 is active • single-task thread 1 (ST1) – only thread 1 is active • Resources that were partitioned in MT-mode are re-combined to give the single active logical processor use of all of the resources • Moving the processor from between modes

Operating System And Applications • An HT processor appears to the OS and application SW as 2 processors • The OS manages logical processors as it does physical processors The OS should implement two optimizations: • Use HALT if only one logical processor is active • Allows the processor to transition to either the ST0 or ST1 mode • Otherwise the OS would execute on the idle logical processor a sequence of instructions that repeatedly checks for work to do • This so-called “idle loop” can consume significant execution resources that could otherwise be used by the other active logical processor • On a multi-processor system, • Schedule threads to logical processors on different physical processors before scheduling multiple threads to the same physical processor • Allows SW threads to use different physical resources when possible

Computer Architecture Advanced Topics