Unlocking Processor Performance: Limits & Strategies

Lecture on High Performance Processor Architecture(CS05162) Limits on Instruction-Level Parallelism An Hong han@ustc.edu.cn Fall 2007 University of Science and Technology of China Department of Computer Science and Technology

Limits to ILP • Conflicting studies of amount • Benchmarks (vectorized Fortran FP vs. integer C programs) • Hardware sophistication • Compiler sophistication • How much ILP is available using existing mechanisms with increasing HW budgets? • Do we need to invent new HW/SW mechanisms to keep on processor performance curve? • DLPs: Intel MMX, SSE, SSE2; Stream Processors • TLPs: IBM Power5 (SMT/CMP) • PCAs: RAW, Smart Memory, TRIPS • etc. CS of USTC AN Hong

Overcoming Limits • Advances in compiler technology + significantly new and different hardware techniques may be able to overcome limitations assumed in studies • However, unlikely such advances when coupled with realistic hardware will overcome these limits in near future CS of USTC AN Hong

Limits to ILP • Initial HW Model here; MIPS compilers. Assumptions for ideal/perfect machine to start: 1. Register renaming • infinite virtual registers => all register WAW & WAR hazards are avoided 2. Branch prediction • perfect; no mispredictions 3. Jump prediction • all jumps perfectly predicted (returns, case statements)2 & 3  no control dependencies; perfect speculation & an unbounded buffer of instructions available 4. Memory-address alias analysis • addresses known & a load can be moved before a store provided addresses not equal; 1&4 eliminates all but RAW • Also: perfect caches; 1 cycle latency for all instructions (FP *,/); unlimited instructions issued/clock cycle; CS of USTC AN Hong

Limits to ILP HW Model comparison CS of USTC AN Hong

Upper Limit to ILP: Ideal Machine FP: 75 - 150 Integer: 18 - 60 Instructions Per Clock CS of USTC AN Hong

More Realistic HW: Window Impact Change from Infinite window 2048, 512, 128, 32 FP: 75 - 150 Integer: 18 - 60 FP: 9 - 150 Integer: 8 - 63 CS of USTC AN Hong

FP: 75 - 150 More Realistic HW: Branch Impact FP: 15 - 45 Change from Infinite window to examine to 2048 and maximum issue of 64 instructions per clock cycle Integer: 18 - 60 IPC Integer: 6 - 12 No prediction Perfect Tournament BHT (512) Profile CS of USTC AN Hong

Misprediction Rates CS of USTC AN Hong

More Realistic HW: Renaming Register Impact (N int + N fp) Change 2048 instr window, 64 instr issue, 8K 2 level Prediction FP: 11 - 45 Integer: 5 - 15 IPC Infinite 256 128 64 32 None CS of USTC AN Hong

More Realistic HW: Memory Address Alias Impact Change 2048 instr window, 64 instr issue, 8K 2 level Prediction, 256 renaming registers FP: 4 - 45 (Fortran, no heap) Integer: 4 - 9 IPC Perfect Global/Stack perf;heap conflicts Inspec.Assem. None CS of USTC AN Hong

Realistic HW: Window Impact Perfect disambiguation (HW), 1K Selective Prediction, 16 entry return, 64 registers, issue as many as window FP: 8 - 45 Integer: 6 - 12 IPC Infinite 256 128 64 32 16 8 4 CS of USTC AN Hong

Analysis of the ILP Limit What Went Wrong? • Preserving sequential semantics while reordering instructions is hard--esp. in hardware • Limits to reordering • Branches:control flow limit • loads and stores:data flow limit CS of USTC AN Hong

The Dynamically-Scheduled Superscalar Model Instructions are sequenced and retired in program order (Fixed-size window) Analysis of the ILP Limit • 抽取ILP的方法 • 建立一个指令窗口，确定控制依赖 • 确定和最小化该窗口中指令间的数据依赖 • 调度指令并行执行 • 软件抽取ILP的方法/硬件抽取ILP的方法 CS of USTC AN Hong

The Dynamically-Scheduled Superscalar Model Instructions are sequenced and retired in program order (Fixed-size window) Analysis of the ILP Limit • In-order sequencing establishes the correct data dependences between instructions required to implement the meaning of the program CS of USTC AN Hong

The Dynamically-Scheduled Superscalar Model Instructions are sequenced and retired in program order (Fixed-size window) Analysis of the ILP Limit Issue 1:No enough ready-to-execute useful instuctions due to two kinds of interruptions: At the sequencing end • (Direct interruptions) • Instruction cache misses • Branch mispredictions At the retirement end • (Indirect interruptions) • long execution latencies • FP divide • Load (data cache miss) CS of USTC AN Hong

The Dynamically-Scheduled Superscalar Model Instructions are sequenced and retired in program order (Fixed-size window) Analysis of the ILP Limit Issue 2: Not conductive to high processor utilization due to sequencing order and global data-driven order rarely match ! • Execution should take place in global data-driven order(data-flow order), • but execution order is constrained by sequencing order(control-flow order). CS of USTC AN Hong

Analysis of the ILP Limit • ILP的提高受限于硬件复杂度 • Dynamically re-order instructions to fill multiple execution units • Must preserve sequential semantics=>require dependency checking • Complexity grows as product of instructions in flight and number of execution units • The work by Sun, IBM, Compaq indicates that a superscalar width of about 4 is the current cost vs. Performance point CS of USTC AN Hong

Analysis of the ILP Limit • Doubling issue rates above today’s 3-6 instructions per clock, say to 6 to 12 instructions, probably requires a processor to • issue 3 or 4 data memory accesses per cycle, • resolve 2 or 3 branches per cycle, • rename and access more than 20 registers per cycle, and • fetch 12 to 24 instructions per cycle. • The complexities of implementing these capabilities is likely to mean sacrifices in the maximum clock rate • E.g, widest issue processor is the Itanium 2, but it also has the slowest clock rate, despite the fact that it consumes the most power! CS of USTC AN Hong

Analysis of the ILP Limit • ILP的提高受限于传统结构中的长延迟事件 • 短延迟: 浮点除法；分支处理；访问本地存储系统； • 长延迟：访问远程的存储系统；由并发操作引起的延迟不确定的同步等待事件 • ILP的提高受限于单指令流中固有的并行性特征 • SPEC CPU 2000(int, fp) • TPC (OLTP, DSS) • ILP的提高受限于串行指令流中的偏序关系 • von Neumann计算模型 vs.数据流计算模型 CS of USTC AN Hong

Limits to ILP • Most ILP techniques for increasing performance increase power consumption • Multiple issue processors techniques all are energy inefficient: • Issuing multiple instructions incurs some overhead in logic that grows faster than the issue rate grows • Growing gap between peak issue rates and sustained performance • Number of transistors switching = f(peak issue rate), and performance = f( sustained rate), growing gap between peak and sustained performance  increasing energy per unit of performance • The key question is whether a technique is energy efficient: does it increase power consumption faster than it increases performance? CS of USTC AN Hong

How to Exceed ILP Limits of this study? CS of USTC AN Hong

Performance beyond single thread ILP • There can be much higher natural parallelism in some applications (e.g., Database or Scientific codes) • Explicit Thread Level Parallelism or Data Level Parallelism • Thread: process with own instructions and data • thread may be a process part of a parallel program of multiple processes, or it may be an independent program • Each thread has all the state (instructions, data, PC, register state, and so on) necessary to allow it to execute • Data Level Parallelism: Perform same operations on data, and lots of data CS of USTC AN Hong

Unlocking Processor Performance: Limits & Strategies

Unlocking Processor Performance: Limits & Strategies

Presentation Transcript

Lecture on High Performance Processor Architecture ( CS05162 )

Processor Architecture

Lecture on High Performance Processor Architecture ( CS05162 )

Processor Architecture

Lecture on High Performance Processor Architecture ( CS05162 )

Lecture on High Performance Processor Architecture ( CS05162 )

Lecture on High Performance Processor Architecture ( CS05162 )

Lecture on High Performance Processor Architecture ( CS05162 )

Lecture 1 An Overview of High-Performance Computer Architecture

High Performance Processor Architecture

Processor Architecture

High Performance Programming on a Single Processor: Memory Hierarchies

High-Performance Computer Architecture

High Performance Computing on an IBM Cell Processor

High Performance Computing on an IBM Cell Processor Bioinformatics

High Performance Computing on an IBM Cell Processor --- Bioinfomatics

High Performance Computing on an IBM Cell Processor Bioinformatics

Lecture on High Performance Processor Architecture ( CS05162 )

Lecture on High Performance Processor Architecture ( CS05162 )

Lecture on High Performance Processor Architecture ( CS05162 )