Instruction prefetching in SMT(Simultaneous Multithreading) system and impact on the performance

Instruction prefetching in SMT(Simultaneous Multithreading) system and impact on the performance by Choi, Jun-Shik Park, Joo Hyung

Contents • Purpose • Background • Theory • Simulation • Results • Conclusion

1. Purpose • For speed up • To take advantage of ILP(Instruction level parallelism) and TLP(Thread level parallelism), SMT has considered. • To reduce cache miss penalty and to use memory BW efficiently, prefetch has used.

2. Background • Traditional Processor • Out-of-order Execution • Cache Prefetching

Traditional Processor • A traditional processor would stall during memory latency from the time data miss happen to the time data arrival. Memory Latency Stall time L1 Miss Data Arrival

Out-of-order Execution • Because data and control dependencies must be observed, the processor will still stall at some point if memory latency is long. Memory Latency Stall time L1 Miss Data Arrival Independent Instr. Dependent Instr.

Cache Prefetching • Cache prefetching overcomes this restriction by bringing data to the L1 cache or an on-chip buffer to avoid as much as possible of the cache miss penalty. Memory Latency time Prefetch Data Arrival Dependent Instr. L1 Miss

3. Theory • Simultaneous Multithreading(SMT) • Plenty of resources • Instruction level parallelism • Thread level parallelism • Markov prefetcher

Prefetch Methods • Stride Prefetcher • Memory reference separated by constant stride • Recursive Prefetcher • Designed for linked data structure as the pattern • Markov Prefetcher • Based on miss address • Etc…

Basic Markov <Example> 1,2,3,4,3,5,1,3,6,6,5,1,1,2,3,4,5,1,2,3,4,3 1 2 (20%) (100%) 1 2 3 (60%) (100%) 3 6 (20%) (100%) 1 2 history

The Address Sequence in Prefetch • Miss address (IL1-cache miss) stream as a prediction source • Too wide bandwidth for CPU demand • L1 cache could make the miss address sequence less frequently

Problem in Realizing Pure Markov Prediction • Programs reference millions of addresses and it is impossible to record all references in a single table

State (1-history) 1 prediction 2 prediction 3 prediction 4 prediction 1 2 1 3 - 2 3 - - - 3 4 5 - - 4 3 5 - - 5 1 - - - 6 5 6 - - Prefetch Table

Address request Prefetch Buffer L1 Cache Prefetcher Miss address Prefetch Table L2 Cache Memory Prefetch Diagram

CPU Address Request (L1 Miss) N(not matched) Look up table Prefetch Table Y(matched) Store Prefetch Data on L2 Examining cache look up - Data Transfer from L2 to CPU - Update or Insert Information to Prefetch Buffer Prefetch Algorithm

4. Simulation • Modified Code: ss_smt-1.0 • Specification • Thread: 2 • Cache: L1(64KB), L2 • Number of Instructions: 100 millions • Used Benchmark • MCF(Integer) and ART(Floating point) • GCC(Integer) and MESA(Floating point)

5. Result • Testbenches for the 2 threads • MCF and ART • L1 miss rate = 0.0794, 0.0921 • Number of L1 miss = Number of access to PFB: 23, 7 • GCC and MESA • L1 miss rate = 0.0010, 0.0009 • Number of L1 miss = Number of access to PFB: 15, 13

Benchmark Reference 1 • The following benchmarks grow quickly to their target sizes (expressed in megabytes) and then stay there -----> - ART max max num num rsz vsz obs unchanged stable? art 3.7 4.3 157 37 x - MCF

Benchmark Reference 2 • Change size over time - GCC - MESA max max num num rsz vsz obs unchanged stable? mesa 9.4 23.1 132 131 stable

6. Conclusion • A prefetcher using Markov algorithm has simulated. • To make Markov Prefetcher efficient in the system, it should have enough training time and L1 misses, because the prefetcher is operated on the basis of the L1 miss address sequence history. • Disadvantage of Markov prefetcher • High hardware cost, not a good stand-alone prefetcher

Instruction prefetching in SMT(Simultaneous Multithreading) system and impact on the performance