280 likes | 402 Views
Signature Buffer: Bridging Performance Gap between Registers and Caches. Lu Peng, Jih-Kwon Peir, Konrad Lai. Introduction. Two types of storage Registers Fast and small Supply data for operations Memory Large and slow Cache for recently used data
E N D
Signature Buffer: Bridging Performance Gap between Registers and Caches Lu Peng, Jih-Kwon Peir, Konrad Lai
Introduction • Two types of storage • Registers • Fast and small • Supply data for operations • Memory • Large and slow • Cache for recently used data • Most RISC only operates on data from registers • Data communication path • Producer -> store -> load -> consumer
Introduction • Future processors with 35nm technology • 10 GHz clock • 64 KB L1 cache • 3-7 cycles L1 cache access time • IPC degrades by 3.5% per additional cycle on L1 cache access time
Signature Buffer • Zero-cycle load • “The load and its dependent instructions can be fetched, dispatched and executed at the same time” • Avoid address calculation • Each load and store uses a signature for accessing the storage • The signature buffer can be accessed in early pipeline stages • A signature consists of, • Color of the base register • Displacement value
Outline • Motivation • Implementation • Performance evaluation
Motivation –Memory Reference Correlations • Signature correlations • Store-load and load-load can be correlated directly by the signature • Signature reference locality • Nearby memory references often differ by small displacement value with the same base register
Example 1 Signature correlations Signature reference locality Source and Assembly Codes of Function copy_disjunct from Parser
Example 2 Source and Assembly Codes of Function bsW from Bzip
Signature Buffer Initial State 0 1 2 3 32
Signature Buffer 0 1 2 -> 32 1 100 3 32 -> 33 1 -- 100
Data Alignment SB Directory SB Data Array L1 Data Array L1 Tag Array Requests (Signature): A-001 -> A-101 -> B-010 -> X-000(Real Address) : C-100 D-000 D-101 D-000
Data Alignment SB Directory SB Data Array L1 Data Array L1 Tag Array SB MISS! Requests (Signature): A-001 -> A-101 -> B-010 -> X-000(Real Address) : C-100 D-000 D-101 D-000
Data Alignment SB Directory SB Data Array L1 Data Array L1 Tag Array SB MISS! Requests (Signature): A-001 -> A-101 -> B-010 -> X-000(Real Address) : C-100 D-000 D-101 D-000
Data Alignment SB Directory SB Data Array L1 Data Array L1 Tag Array SB MISS! Requests (Signature): A-001 -> A-101 -> B-010 -> X-000(Real Address) : C-100 D-000 D-101 D-000
Data Alignment SB Directory SB Data Array L1 Data Array L1 Tag Array SB MISS!Invalidate high A, low B Requests (Signature): A-001 -> A-101 -> B-010 -> X-000(Real Address) : C-100 D-000 D-101 D-000
Microarchitecture • Bypass I • SB hit or an early store-load forwarding • Bypass II • Normal store-load forwarding
Performance Evaluation –IPC SB – nospec 13% speedup SB – perfect 14% speedup
Performance Evaluation –Load Distribution Normal S-L Forw. & L1 access reduced t0 30%, 70% of loads benefit from SB SB With perfect memory dependence predictor obtains 23% zero-cycle load
Performance Evaluation –SB Hit Ratio Average SB hit rate is about 51%
Performance Evaluation –Comparison with L0 Cache Performance benefit of SB goes up with L1 latencyand always above having a L0 cache
Performance Evaluation –Comparison with L0 Cache Larger L0 => higher hit rate SB is less sensitive to size.
Advantages • Non-speculative • Data obtained from the SB without intervening stores is always correct • All loads can access the data from the SB without any restriction on the type of the loads or base registers. • Loads through the SB can bypass the address generation and cache access completely. • Store/Load correlation is established from the instruction encoding bits to simplify hardware requirement. • SB uses line-based granularity to capture spatial locality.
Loads – SB Specific • Early S-L forwarding • A load has identical signature with an early store in the LSQ with no intervening store in between. (zero-cycle load & SB hit) • Early SB access • SB is accessed after a load is fetched and decoded (zero-cycle load & SB hit) • Delayed SB access • SB is accessed after memory dependence resolutions because of intervening stores (SB hit) • Non-Signature Forwarding • Consecutive SB misses to the same SB line gets forwarded data from previous misses (SB miss)