1 / 18

CS 7960-4 Lecture 23

CS 7960-4 Lecture 23. Processor Case Studies, The Microarchitecture of the Pentium 4 Processor G. Hinton et al. Intel Technology Journal Q1, 2001. Quick Facts. November 2000: Willamette, 0.18 m , Al interconnect, 42M transistors, 217mm 2 , 55W, 1.5GHz

roger
Download Presentation

CS 7960-4 Lecture 23

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS 7960-4 Lecture 23 Processor Case Studies, The Microarchitecture of the Pentium 4 Processor G. Hinton et al. Intel Technology Journal Q1, 2001

  2. Quick Facts • November 2000: Willamette, 0.18m, Al interconnect, • 42M transistors, 217mm2, 55W, 1.5GHz • February 2004: Prescott, 0.09m, Cu interconnect, • 125M transistors, 112mm2, 103W, 3.4GHz

  3. Clock Frequencies • Aggressive clocks => little work per pipeline stage • => deep pipelines => low IPC, large buffers, high • power, high complexity, low efficiency • 50% increase in clock speed => 30% increase in • performance Mispredict latency = 10 cyc Mispredict latency = 20 cyc

  4. Variable Clocks • The fastest clock is defined as the time for an • ALU operation and bypass (twice the main • processor clock) • Different parts of the chip operate at slower • clocks to simplify the pipeline design (e.g. RAMs)

  5. Microarchitecture Overview

  6. Front End • ITLB, RAS, decoder – Note: no I-cache • Trace Cache: contains 12Kmops (~8K-16KB • I-cache), saves 3 pipe stages, reduces power • Front-end BTB accessed on a trace cache miss • and smaller Trace-cache BTB to detect next • trace line – no details on branch pred algo • Microcode ROM: implements mop translation for • complex instructions

  7. Execution Engine • Allocator: resource (regs, IQ, LSQ, ROB) manager • Rename: 8 logical regs are renamed to 128 phys • regs; ROB (126 entries) only stores pointers • (Pentium 4) and not the actual reg values (unlike • P6) – simpler design, less power • Two queues (memory and non-memory) and • multiple schedulers (select logic) – can issue six • instrs/cycle

  8. Schedulers • Register porting, multiple queues • 3GHz clock speed = time for a 16-bit add and bypass

  9. NetBurst • 3GHz ALU clock = time for a 16-bit add and bypass • to itself (area is kept to a minimum) • Used by 60-70% of all mops in integer programs • Staggered addition – speeds up execution of • dependent instrs – an add takes three cycles • Early computation of lower 16 bits => early • initiation of cache access

  10. Data Cache • 4-way 8KB cache; 2-cycle load-use latency for • integer instrs and 6-cycle latency for fp instrs • Distance between load scheduler and execution • is longer than load latency • Speculative issue of load-dependent instrs and • selective replay • Store buffer (24 entries) to forward results to loads • (48 entries) – no details on load issue algo

  11. Cache Hierarchy • 256KB 8-way L2; 7-cycle latency; new operation • every two cycles • Stream prefetcher from memory to L2 – stays • 256 bytes ahead • 3.2GB/s system bus: 64-bit wide bus at 400MHz

  12. Performance Results

  13. Recent Advances • Willamette (2000)  Prescott (2004) • L1 data cache 8KB  16KB • L2 cache 256KB  1MB • Pipeline stages 20  31 • Frequency 1.5GHz  3.4GHz • Technology 0.18m 0.09m

  14. Research  Real Processors • Palacharla (clustering), Optimal-pipeline-depths, • Trace cache, Stream buffers, SMT, Voltage scaling, • Questions: branch predictor, clustered organization, • memory dependences, power optimizations

  15. UltraSPARC IV • CMP with 2 UltraSPARC IIIs – speedups of 1.6 • and 1.14 for swim and lucas (static parallelization) • UltraSPARC III : 4-wide, 16 queue entries, 14 • pipeline stages • 4KB branch predictor – 95% accuracy, 7-cycle • penalty • 2KB prefetch buffer between L1 and L2

  16. Alpha 21364 • Tournament predictor – local and global; 36Kb • Issue queue (20-Int, 15-FP), 4-wide Int, 2-wide FP • Two clusters, each with 2 FUs and a copy of the • 80-entry register file

  17. Next Class’ Paper • “Value Prediction”,

  18. Title • Bullet

More Related