1 / 62

Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

Performance Counters on Intel® Core™ 2 Duo Xeon® Processors. Intel Software College. Objective. At the successful completion of this module, you will be able to

torin
Download Presentation

Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Intel Software College

  2. Objective • At the successful completion of this module, you will be able to • Use the VTune™ Performance Analyzer to identify micro-architectural bottlenecks in software running on Intel® Core™ 2 Duo Xeon® processors • Address the performance bottleneck for Intel® Core™ 2 Duo Xeon® processors Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

  3. Agenda • Core® micro-architecture review • Event basics • Events identifying Intel® Core™ 2 Duo Xeon® processors bottlenecks • Summary Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

  4. Shared L2 = 4MB CPU-0L1D=32KB CPU-1L1D=32KB L0/L1 DTLBPMH L0/L1 DTLBPMH CPU-0L1I=32KB CPU-1L1I=32KB CPU-0Core CPU-1Core Next Generation Micro ArchitectureIntel® Core™ 2 Duo Processor FSB Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

  5. Architecture Block and Instruction Flow To L2 Cache/Memory Fetch / Decode Execute Bus Unit 32 KBData Cache 32 KBInstruction Cache Next IP FP Add SIMD Port IntegerArithmetic Branch Target Buffer InstructionDecode (4 issue) FP Div/Mul IntegerShift/Rotate SIMD Port IntegerArithmetic Microcode Sequencer SIMD Port Reservation Stations (RS)32 entry IntegerArithmetic Scheduler / Dispatch Ports Load Port Register Allocation Table (RAT) MemoryOrderBuffer(MOB) Store Addr Port Retire Store Data Re-Order Buffer (ROB) – 96 entry Port Disclaimer: This block diagram is for example purposes only. Significant hardware blocks have been arranged or omitted for clarity. Some resources (Bus Unit, L2 Cache, etc…) are shared between cores. IA Register Set Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

  6. Agenda • Core® micro-architecture review • Event basics • Events identifying Intel® Core™ 2 Duo Xeon® processors bottlenecks • Summary Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

  7. VTune™ Analyzer Event BasicsEvents Versus Samples • A performance counter increments on the CPU every time an event occurs • A sample of the execution context is recorded every time a performance counter overflows • Events = samples * sample after value Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

  8. VTune™ Analyzer Event BasicsRetired Versus Non-Retired Events • Retired events include only events that occur due to instructions that are committed to the machine state. • For example, when measuring the Loads Retired event, a load that occurs on a mispredicted execution path is not counted • Most retired events can also be precise events. • No event skid Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

  9. VTune™ Analyzer Event BasicsEvent Skid • On Pentium® 4 and Intel Xeon™ processors, events can appear a few lines after they actually occur in the disassembly source view, which is due to interrupt latency. Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

  10. VTune™ Analyzer Event BasicsPrecise Events • Do not suffer from event skid • Use hardware to record the address where the event occurs • Reduce the number of events you can collect at once Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

  11. VTune™ Analyzer Event BasicsPrecise Events (cont.) On: Off: Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

  12. VTune™ Analyzer Event BasicsEvent Ratios • Calculate common processor performance metrics • Built in to VTune™ analyzer Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

  13. VTune™ Analyzer Event BasicsClockticks and Instructions Retired • Clockticks measure CPU cycles • Clockticks/processor frequency = time in seconds • Instructions retired = the number of instructions committed to the processor state (executed completely) • Cycles per instruction (CPI) = clockticks / instructions retired High CPI usually indicates opportunities for optimization. Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

  14. VTune™ Analyzer Event BasicsClockticks Versus Non-halted Clockticks • Clockticks = halted + non-halted cycles (but no sleep cycles) • The clockticks event measures cycles when the physical processor is not in any sleep modes. • The non-halted clockticks event measures the cycles that a logical processor is not asleep or halted. • If you measure clockticks on a Hyper-Threaded technology-enabled system while running a single-threaded application, you will see a lot of samples around the halt instruction in processor.sys. Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

  15. Agenda • Core® micro-architecture review • Event basics • Performance tuning for Intel® Core™ 2 Duo Xeon® processors • Events for performance • Performance optimization methodology • X86 cycle accounting • Summary Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

  16. Performance Events along Uop Flow (1) To L2 Cache /Memory Fetch / Decode Execute Bus Unit 32 KBData Cache 32 KBInstruction Cache Next IP FP Add SIMD Port IntegerArithmetic Branch Target Buffer InstructionDecode (4 issue) FP Div/Mul IntegerShift/Rotate SIMD Port IntegerArithmetic Microcode Sequencer SIMD Port Reservation Stations (RS)32 entry IntegerArithmetic Scheduler / Dispatch Ports Load Port Register Allocation Table (RAT) MemoryOrderBuffer(MOB) Store Addr Port Retire Store Data Re-Order Buffer (ROB) – 96 entry Port IA Register Set Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

  17. Memory Access • Latencies • L1 miss hits L2 ~ 10 cycles • L2 miss, access to memory ~300 cycles (server/FBD) • L2 miss, access to memory ~165 cycles (Desk/DDR2) • Cache Bandwidth • Bandwidth to cache ~ 8.5 bytes/cycle • Memory Bandwidth • Desktop ~ 6 GB/sec/socket (linux*) • Server ~3.5 GB/sec/socket Performance Counters on Intel® Core™ 2 Duo Xeon® Processors * Other names and brands may be claimed as the property of others.

  18. Performance Events for the Front End Memory BW = 64*Bus_Trans_Mem*freq/Cpu_Clk_Unhalted Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

  19. Lab Activity 1:Calculating the Memory Access Bandwidth • In this lab, you will calculate the bandwidth of memory with the performance counter events using the VTune™ analyzer Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

  20. Performance Events along Uop Flow (2) To L2 Cache Fetch / Decode Execute Bus Unit 32 KBData Cache 32 KBInstruction Cache Next IP FP Add SIMD Port IntegerArithmetic Branch Target Buffer InstructionDecode (4 issue) FP Div/Mul IntegerShift/Rotate SIMD Port IntegerArithmetic Microcode Sequencer SIMD Port Reservation Stations (RS)32 entry IntegerArithmetic Scheduler / Dispatch Ports Load Port Register Allocation Table (RAT) MemoryOrderBuffer(MOB) Store Addr Port Retire Store Data Re-Order Buffer (ROB) – 96 entry Port Resource_Stalls measures here transfer from Decode IA Register Set Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

  21. Performance Events of Resource _Stalls • Uop flow to OOO engine blocked by downstream cause • Resource_Stalls.BR_MISS_CLEAR • pipeline stalls due to flushing mispredicted branches • Combine in Resource_stalls.CLEAR • Mispredicted branch followed by fp inst • Resource_Stalls.ROB_FULL • 96 instructions in ROB • Resource_Stalls.LD_ST • All Store or Load buffers in use • Resource_Stalls.RS_FULL • 32 instructions waiting for inputs in Reservation Station Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

  22. Measuring Instruction Starvation • There really is no good way to do this • Anti Correlate with Resource_stalls.RS_full • There could be • Cycles Decode queue is empty • Cycles RS is empty • Cycles ROB is empty Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

  23. Performance Events along Uop Flow (3) To L2 Cache Fetch / Decode Execute Bus Unit 32 KBData Cache 32 KBInstruction Cache Next IP FP Add SIMD Port IntegerArithmetic Branch Target Buffer InstructionDecode (4 issue) FP Div/Mul IntegerShift/Rotate SIMD Port IntegerArithmetic Microcode Sequencer SIMD Port Reservation Stations (RS)32 entry IntegerArithmetic Scheduler / Dispatch Ports Load Port Register Allocation Table (RAT) MemoryOrderBuffer(MOB) Store Addr Port Retire Store Data Re-Order Buffer (ROB) – 96 entry Port Rs_uops_dispatched measures at Execution IA Register Set Other stalls measures at Execution Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

  24. Measuring Efficiency in the Execution Stage • OOO engine optimizes instruction issue to functional units from Reservation Station • They wait there until their inputs are available • RS_UOPS_DISPATCHED measures number of uops dispatched from RS on each cycle • There are chains preventing OOO engine from executing in parallel • Partial Register Stall • Partial Flag Register Stall • Domain bypass • Others… Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

  25. Performance Events along Uop Flow (4) To L2 Cache Fetch / Decode Execute Bus Unit 32 KBData Cache 32 KBInstruction Cache Next IP FP Add SIMD Port IntegerArithmetic Branch Target Buffer InstructionDecode (4 issue) FP Div/Mul IntegerShift/Rotate SIMD Port IntegerArithmetic Microcode Sequencer SIMD Port Reservation Stations (RS)32 entry IntegerArithmetic Scheduler / Dispatch Ports Load Port Register Allocation Table (RAT) MemoryOrderBuffer(MOB) Store Addr Port Retire Store Data Re-Order Buffer (ROB) – 96 entry Port Uops_retired measures at Retirement IA Register Set Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

  26. Retirement vs Dispatch • Which Function to work on first? • For loops, difference is due to OOO execution • Fewer False Positives When “Stalls” Are Measured at Dispatch Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

  27. Performance Optimization Methodology • This style of optimization has 2 components • Minimizing instruction count (path length) • A sort of “tree height” minimization • Minimizing deviations from ideal execution • Generically thought of as “stall cycles” • Treating both equally is critical Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

  28. Stalls, Execution Imperfection and Performance Analysis • Stall cycles are used to indicate less than perfect execution • An architectural decomposition of “stalls” can be used to guide the selection of architectural events • The IP correlation of “stalls” and arch events then guides the optimization effort • Stalls have 4 basic components in x86 • Front End stalls • Execution stage instruction starvation (Front End) • Mispredicted branch pipeline flushing • Execution stalls • (Waiting on input/Scoreboard, L2 miss, BW, DTLB, glass jaws etc) • Cycles wasted executing instructions that are not retired Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

  29. Improve Optimization to Reduce Instruction Count,Split Loops to Increase ILP Reduce Branch MispredictionsPGO Traditional Stall Removal Resource_stalls.br_miss_clear will estimate stalls due to Pipeline Flush X86 Cycle Accounting and SW Optimization • Cpu_clk_unhalted = “stalls” + dispatch = “stalls” + non_ret_dispatch + ret_dispatch Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

  30. Cycle Accounting on X86 • Cycles = “stalls” + dispatch • An equality by definition • Cycles ~ CPU_CLK_UNHALTED.CORE • For cpu intensive applications/sampling • Stall Cycles = Cycles with NO uops Dispatched = RS_UOPS_DISPATCH.CYCLES_NONE • Dispatch Cycle=RS_UOPS_DISPATCH Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

  31. Cycle Accounting on X86 (cont.) • Dispatch ~ cycles_dispatch_retiring_uops + cycles_dispatch_non_retiring_uops • Assumes no overlap of retired/non retired uops • Worst Case Senario • Non retired uops = rs_uops_dispatched – (uops_retired.any + Uops_retired.fused) • Non retired uop cycles ~ non retired uops/avg_uops_per_cycle • Fractional Wasted Work = rs_uops_dispatched / (uops_retired.any + uops_retired.fused) - 1 Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

  32. Pulling Cycle Accounting Together Illustrative Example Only, Not Real Data Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

  33. Decomposing Stalls: Elephants First Pipeline Flush = Resource_Stalls.Br_Miss_Clear/cyclesL2 Hits = ( MEM_LOAD_RETIRED.L1D_LINE_MISS - MEM_LOAD_RETIRED.L2_LINE_MISS )* 10/cyclesDTLB/L2 Miss = event count* penalty/cyclesFE + Scoreboard = Stalls – all of the above Illustrative Example Only, Not Real Data Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

  34. Decomposing Unstalled Cycles Non_Retired = (( 1 – (Uops_retired.any+Uops_retired.fused)/RS_Uops_Dispatched) * RS_Uops_Dispatched.Cycles_None / CPU_CLK_UNHALTED.CORE OOO Bursts = Uops_Retired.Any - Stalls – Non_Retired Illustrative Example Only, Not Real Data Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

  35. Pulling it All Together Risks Over-counting / Minimizing FE + Scoreboard But Offers a Guide to Execution Inefficiencies Illustrative Example Only, Not Real Data Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

  36. The “Big 4” Events for Performance • CPU_CLK_UNHALTED.CORE • RS_UOPS_DISPATCHED.CYCLES.NONE • MEM_LOAD_RETIRED.L2_LINE_MISS • BUS_TRANS_ANY.SELF CYCLES, STALLS, UNPREFETCHED LOADS and BANDWIDTH Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

  37. Architectural Pitfalls: The Ants Contribute to “FE + Scoreboard”And don’t forget Micro-Fusion, Macro-fusion, etc.. Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

  38. A Heuristic Break-down for Stall Analysis Stalls? the “Big 4 (L2 cache)”, L1D cache …… Front End Stalls Register related, Domain related Exe Unit Stalls …… Retirement Efficiency And others …… Instructions decoding, LCP… …… Resource Stalls RS related and RAT related Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

  39. A Heuristic Break-down for Stall Analysis (cont.) Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

  40. Lab Activity 2:Using SW tool to reduce the instruction counts • In this lab, you will practice the use of Intel compiler vectorization switch to reduce the instruction counts. Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

  41. Lab Activity 3:Addressing the performance bottleneck in Front End • In this lab, you will identify and address the performance issue caused in the Front End of the processor by the “Big 4” events analysis. Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

  42. Lab Activity 4:Addressing the performance bottleneck in Execution Core • In this lab, you will identify and address the performance issue caused in the execution core of the processor. Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

  43. A Loop Methodology • Identify hot functions and raise optimization • Fix alignments, split loops to enhance vectorization • Identify BW limited functions • Merge BW loops with FP limited loops • Identify L2 misses and add sw prefetch • Optimize flow through OOO Engine • Use loop splitting to assist here Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

  44. More Detailed Event Selection Hierarchy SAV values selected so ratio of samples ~ absorbs penalty Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

  45. More Detailed Event Selection Hierarchy (cont.) SAV values selected so ratio of samples ~ absorbs penalty EX: L1 miss/L2_hit penalty is 10 cycles Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

  46. Summary • Utilize CoreTM micro-architecture for software performance • Front end • OOO execution core • Use the VTune™ analyzer to identify micro-architectural bottlenecks in your software. • Use a cycles accounting methodology to improve the performance. Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

  47. Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

  48. Micro-Architecture Comparison ++ Cedar Mill/Dempsey ** NGMA = Next Generation Micro-Architecture (Conroe/Woodcrest) = per core Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

  49. FP Add SIMD Port 1 IntegerArithmetic FP Div/Mul IntegerShift/Rotate SIMD Port 0 IntegerArithmetic FP Add/ Mul/Div IntegerShift/Rotate SIMD SIMD IntegerArithmetic SIMD Port Port 5 IntegerArithmetic Integer Multiply Port 2 IntegerArithmetic Load Port 2x Core Freq Port 4 Store Execution Unit Comparisons NGMA Intel NetBurst® Micro-Architecture Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

  50. L2 $ Hit, L1DTLB Miss L1 $ Hit, L1DTLB Miss L1 $ Hit, L1DTLB Hit DTLB Structure Disclaimer: Data is from a pointer chasing microbenchmark and for illustrative purposes only Performance Counters on Intel® Core™ 2 Duo Xeon® Processors

More Related