Effective ahead pipelining of instruction block address generation

Effective ahead pipelining of instruction block address generation André Seznec and Antony Fraboulet IRISA/ INRIA

Instruction fetch on wide issue superscalar processors • Fetching 6-10 instructions in // on each cycle: • Fetch can be pipelined • I-cache can be banked • Instruction streams are “relatively long” • Next block address generation is critical Not the real issueµ

Instruction block address generation • One block per cycle • Speculative: accuracy is critical • Accuracy comes with hardware complexity: • Conditional branch predictor • Sequential block address computation • Return address stack read • Jump prediction • Branch target prediction/computation • Final address selection

Using a complex instruction address generator (IAG) • PentiumPro, Alpha EV6, Alpha EV8: • Fast (relatively) inaccurate IAG responding in a single cycle backed with a complex multicycle IAG • Loss of a significant part of instruction bandwidth • Overfetching implemented on Alpha EV8 • Trend to deeper pipelines: • Smaller line predictor: less accurate • Deeper pipelined IAG: longer misfetch penalty • 10 % misfetches, 3 cycles penalty : 30 % bandwidth loss

Ahead pipelining the IAG • Suggested with multiple block ahead prediction in Seznec, Jourdan, Sainrat and Michaud (ASPLOS 96): • Conventional IAG: • use information available with block A to predict block B • Multiple block ahead prediction: • use information available with A to predict block C (or D) • This paper: • How to really do it

Fetch blocks for us Rc: B inst /fetch block 1st inst Rs: ends with the cache block Rl ends with the second block

Fetch blocks (2) 1st inst 1nt: 1NT bypassing Cond Cond aNT: all NT bypassing 0nt: no NT bypassing

Cond 0nt 0NT: no extra CTI Cond Uncond 0nt 0NT 0NT+: no extra cond Fetch blocks (3)

Technological hypothesis • Alpha EV8 front-end + twice faster clock: • Cycle = 1 EV8 cycle phase • Cycle = time to cross a 8 to 10 entries multiplexor • Cycle = time to read a 2Kb table and route back the read data as an index. • 2 cycles for reading a 16Kb table and routing back the read data as an index

Hierarchical IAG • Complex IAG + Line predictor • Conventional complex IAG spans over four cycles: • 3 cycles for conditional branch prediction • 3 cycles for I-cache read and branch target computation • Jump prediction , return stack read • + 1 cycle for final address selection • Line prediction: • a single 2Kb table + 1-bit direction table • select between fallthrough and line predictor read

Hierarchical IAG (2) Cond. Jump Pred Final Selection RAS Pred Check LP Branch target addresses + decode info

Ahead pipelining the IAG • Same functionalities required: • Final selection: • uses the last cycle in the address generation • Conditional branch predictor • Jump predictor • Return address stack: • Branch target prediction • Recomputation impossible ! • Use of a Branch target buffer • Decode information: MUST BE PREDICTED

Ahead pipelining the IAG (2) • Initiate table reads N cycles ahead with information available: • N-block ahead address • N-block ahead branch history • Problem: significant loss of accuracy Use of N-block ahead (address, history) + intermediate path !

Inflight path incorporation • Pipelined table access + final logic: • Column selection • Wordline selection • Final logic • Insert one bit of inflight information per intermediate block • Not the same bit for each table in the predictor !

Ahead conditional branch predictor • Global history branch predictor • 512 Kbits • 2bc-gskew 4 cycles prediction delay: • Indexed using 5 blocks ahead (address + history) • One bit per table per intermediate block Accuracy equivalent to conventional conditional branch predictors

Ahead Branch Target Prediction • Use of a BTB: • Tradeoffs: • Size: 2Kb/16Kb 1 vs 2 cycles • tagless or tags (+1 cycle) • Associativity: +1 cycle ? • Difficulty: • The longer the pipeline read, the larger the number of possible pathes especially if nottaken branches are bypassed

Ahead Branch target prediction (2) • 2Kb is too small: • 16 Kb 2 cycles access time • Associativity is required, but extra cycle for tag check is disastrous: • tagless + way-prediction ! • 2-way skewed-associativity: • to incorporate different inflight information on the two ways

Ahead Branch Target Prediction (3) • Bypassing not-taken branches: • N possible branches : N possible targets • N targets per entry: waste of space • Many single entry used • Read of N contiguous entries

Ahead jump predictor, return stack • Jump predictor: history + address • 3 cycles ahead + inflight information • Return address stack as usual: • Direct access to the previous cycle top • Access to the possible address to be pushed

Decode information !! • Needs the decode information of the current block to select among: • Fallthrough • Branch targets • Jump target • Return address stack top Cannot be got from the I-cache, must be predicted !

Ahead pipelined IAG and decode Cond. Jump Final Selection BTB Predicted Decode RAS Decode Info ??

Decode information (2) • 1st try (not in the paper): in the BTB • Entries without targets (more capacity misses) • Duplicated decode information if multiple targets • Decode mispredictions, but correct target prediction by jump predictor, RAS, fallthrough Not very convincing results

Predicting Decode information Principle: Decode information associated with a block is associated with the block address itself whenever possible whenever possible: • BTB, jump predictor • But not for returns and fallthrough blocks

Predicting decode information (2) • Return block decode prediction: • Return decode prediction table indexed with (one cycle ahead) top of the return stack • Systematic decode misprediction if chaining call and associated return in two successive blocks. Sorry !

Fallthrough block decode prediction • Just after a taken control-flow instruction, no time to read an extra table before the final selection stage: • Store decode information for block A+1 with address A in BTB, jump predictor • Fall through after fall through: • 2-block ahead decode prediction table: • A to read A+2 decode info

0 -4 1 -3 2 -2 3 -1 Hierarchical IAG vs ahead pipelined IAG LP IAG init. LP check selection Decode pred check completed BTB JP Ret dec 2A dec CB selection

Recovering after a misprediction • Generation of address for the next block after recovery should have begun 4 cycles before recovery: • 4 cycles extra misprediction penalty !? • Unacceptable • Checkpoint/repair mechanism should provide information for a smooth restart

Recovering after a misprediction (2) predicting the next block • There is no time to read any of the table in the IAG, only time to cross the final selection stage. • The checkpoint must provide everything that was at the entry of the final stage one cycle ahead. Less than 200 bits for all policies, except bypassing all not-takenbranches

Recovering from misprediction (3) 3rd block and subsequent 3rd block • BTB and Jump predictor cannot provide targets in time: • All possible targets must come from checkpoint • Conditional branch predictions are not available 4th and 5th block • Conditional branch predictions are not available (approximately) 4 possible sets of entries for the final selection stage in the checkpoint repair, but 2 cycles access time.

Recovering from misprediction (4) one or two bubbles restart • If full speed I-fetch resuming is too costly in the checkpoint mechanism then: • Two bubbles restart: All possible targets recomputed/predicted Only conditional predictions for next and third block to be checkpointed • One bubble restart: only one set of exits from BTB, jump predictor + conditional branch predictions to be checkpointed

Performance evaluation • SPEC 2000 int + float • Traces of 100 million instructions after skipping the initialization • Immediate update

Average fetch block size (integer applications) • Instruction streams: 9.79 instructions • 2 * 8 inst/cache line • No bypassing nottaken branches: 5.81 instructions • Bypassing one taken branch: 7.28 instructions • Bypassing all not taken branches: 7.40 instructions Bypassing a single taken branch is sufficient

Accuracy of IAGs (integer benchmarks) misp/KI Very similar accuracies

Misfetches : IAG faults corrected at decode time misfetches/KI Integer applications

Misfetches (2) misfetches/KI floating-point applications

(Maximum) Instruction fetch bandwidth integer applications

Summary • Ahead pipelining the IAG is a valid alternative to the use of a hierarchy of IAGs: • Accuracy in the same range • Significantly higher instruction bandwidth • Main contributions: • Decode prediction • Checkpoint/repair analysis

Future works • Sufficient to feed 4-to-6 way processors • May be a little bit short for 8-way or more processors • We plan to extend the study to: • Decoupled instruction fetch front end (Reinmann et al ) • Multiple (non-contiguous) block address generation • Trace caches

Effective ahead pipelining of instruction block address generation