1 / 39

PowerPC 604 Superscalar Microprocessor

PowerPC 604 Superscalar Microprocessor. IBM, Motorola, Apple. PPC604e Overview. RISC PowerPC family PowerPC architecture : 32-bit effective (logical) addresses,

mdugan
Download Presentation

PowerPC 604 Superscalar Microprocessor

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. PowerPC 604 Superscalar Microprocessor IBM, Motorola, Apple

  2. PPC604e Overview • RISC PowerPC family • PowerPC architecture : • 32-bit effective (logical) addresses, • 8, 16, and 32 bits integer data types, and floating-point data types of 32 and 64 bits (single- and double-precision, respectively). • A superscalar processor : can issue four instructions • Up to seven instructions can execute in parallel.

  3. Overview: 604e has 7 units • The 604e has seven parallel – independent execution units • Floating-point unit (FPU) • Branch processing unit (BPU) • Condition register unit (CRU) • Load/store unit (LSU) • Three integer units (IUs): • — Two single-cycle integer units (SCIUs) • — One multiple-cycle integer unit (MCIU)

  4. Three-stage pipelined floating-point unit (FPU) • Fully IEEE 754 compliant FPU • Supports non-IEEE mode for time-critical operations • Fully pipelined, single-pass double-precision design • Two-entry reservation station to minimize stalls • Thirty-two 64-bit FPRs for single- or double-precision operands

  5. BPU & CRU • BPU Branch Processing Unit with dynamic branch prediction • Two-entry reservation station • Out-of-order execution through two branches • 64-entry fully-associative branch target address cache (BTAC), 512-entry branch history table (BHT) • Two bits per entry predictions • Condition register unit (CRU) • Two-entry reservation station

  6. Condition resolution takes time

  7. Solution: Branch speculation

  8. Branch History Table (BHT)Table of predictors • Each branch given predictor • BHT is table of “Predictors” • Could be 1-bit or more • Indexed by PC address of Branch • most schemes use at least 2 bit predictors • Performance = ƒ(accuracy, cost of misprediction) • Misprediction  Flush Reorder Buffer • In Fetch state of branch: • Use Predictor to make prediction • When branch completes • Update corresponding Predictor Predictor 0 Branch PC Predictor 1 Predictor 7

  9. Branch PC Predicted PC BTB: Branch Address at Same Time as Prediction • Branch Target Buffer (BTB): Address of branch index to get prediction AND branch address (if taken) PC of instruction FETCH Yes: instruction is branch and use predicted PC as next PC =? prediction state bits No: branch not predicted, proceed normally (Next PC = PC+4) Only predicted taken branches and jumps held in BTB Next PC determined before branch fetched and decoded later: check prediction, if wrong kill instruction, update BPb

  10. PPC604 Pipeline

  11. PowerPC604 Pipeline overview • Instruction fetch (IF) — loads decode queue (DEQ) with instructions from I - cache and determines next instruction address • Instruction decode (ID)— time-critical decoding on instructions in dispatch queue (DISQ). • Instruction dispatch (DS)— • up to 4 instructions dispatched – max – in order • one per functional unit • non- time-critical instructions decoding. • determines when instruction can be dispatched to EX Units • At end of DS, instructions and their operands are latched into the execution input latches or into unit’s reservation station. • Rename registers and reorder buffer entries allocated

  12. Execute (E), Complete (C), Writeback • • Execute (E) • instruction flow split among six execution units. Instructions enter execute from dispatch or reservation station. • results written into rename buffer entry ; notifies complete stage • • Complete (C) • ensures correct machine state maintained ; monitors instructions in complete and execute stages. • Instructions removed from reorder buffer (ROB) when complete • Results written back from rename buffers to register at complete or writeback • • Writeback (W) writes back results from rename buffers not written back during complete

  13. 604 Block Diagram – Internal Data paths

  14. Reservation Stations & Result Buses

  15. Execution Latencies

  16. PPC604e Unit Pipeline Stages

  17. Example 1: Instruction timing for Cache HIT

  18. Example 1: Instruction Timing for cache Hit

  19. Branch PC Predicted PC BTB: Branch Address at Same Time as Prediction • Branch Target Buffer (BTB): Address of branch index to get prediction AND branch address (if taken) PC of instruction FETCH Yes: instruction is branch and use predicted PC as next PC =? prediction state bits No: branch not predicted, proceed normally (Next PC = PC+4) Only predicted taken branches and jumps held in BTB Next PC determined before branch fetched and decoded later: check prediction, if wrong kill instruction, update BPb

  20. Example 2 : Branch Taken with BTAC hit No branch penalty; 4 OR is from target stream Cycle 2: instructions 4 – 7 fetched from Target based on address from BTAC HIT Cycle 5: inst. 2 -3 wait for LD to retire (WB) & retire with it

  21. Example 2: Branch taken with BTAC HIT No penalty

  22. Example 3: Branch taken, BTAC HIT, Icache MISS

  23. Ex 4: Branch taken, BTAC Miss, correct at Decode stage One clock penalty, to fetch target group (2,3,4,5) Correction at Decode includes branch on CR (flags), LR

  24. Ex 5: Branch taken, BTAC Miss, correct at Dispatch stage - 2 clock branch penalty

  25. Example 6: Branch taken, BTAC Miss, correct at Execute --- 3 clock penalty

  26. Class Example – real dependencies 1 ADD R1, R2, R3 ; R1 = R2 + R3 2 ADD R2, R1, R4 3 OR R3, R1, R4 4 SUB R3, R2, R3 5 FMUL F7, F5, F6 6 FSUB F8, F10, F7 7 AND R4, R1, R3

  27. PPC604 Pipeline

  28. Pipeline Details: Fetch Stage • Fetches instructions from I cache and loads decode queue (DEQ) • Determines address of next instruction to be fetched. • Keeps queue supplied with instructions for dispatch • Instructions fetched from I cache in groups of four, from a cache block • If only two instructions remain in the cache block, only two instructions are fetched.

  29. next instruction fetch address: • Each stage offers candidate address to be fetched, latest stage has highest priority • As a block is prefetched, branch target address cache (BTAC) and branch history table (BHT) searched with fetch address. • If address is in BTAC, next instruction fetched from that address • DECODE may indicate, based on BHT or an unconditional branch decode, that earlier BTAC prediction was incorrect • BPU can indicate that a previous branch prediction, from the BTAC or DECODE was incorrect

  30. Decode Stage • Handles time-critical decoding of instructions in instruction buffer. • Contains four-instruction buffer (DEQ); shifts one or two pairs of instructions into dispatch buffer as space becomes available. • Branch correction predicts branches whose target is taken from the CTR or LR. Occurs if no CTR or LR updates are pending.

  31. Dispatch Stage • non–time-critical decoding of instructions supplied by decode • determines which instructions can be dispatched • source operands read from register file and dispatched to execute units • dispatched instructions and their operands latched into reservation stations or execution unit input latches. • Dispatched Instructions issued a position in 16-entry completion buffer • Rename Buffer allocated to instruction if needed

  32. Execute Stage • Instruction passed to appropriate execution unit after fetch, decode, and dispatch. EX units have different latencies • Floating-point unit has fully pipelined, three-stage execution unit • EX units write results into appropriate rename buffer & notifies complete stage

  33. Branch Mispredict / Exceptions ? • What if a branch instruction was mispredicted in an earlier Stage ? • Instructions from mispredicted path flushed • Fetching resumes at the correct address. • If an instruction causes an exception, the execution unit reports the exception to the complete stage and continues executing instructions

  34. CompleteStage • maintains correct architectural machine state. • As instruction finish EX, their status is recorded in completion buffer (FIFO) entry. • entries examined in order in which instructions dispatched. • Retains program order, ensures instructions completed in order • four entries examined during each cycle for writeback • completion buffer is used to ensure a precise exception model. .

  35. Write-Back Stage • Write back results from rename buffers not written back by the complete stage. • Each rename buffers has two read ports for write-back, corresponding to the two ports provided for write-back for the GPRs, FPRs, and CR. • Two results can be copied from the write-back buffers to registers per clock cycle.

More Related