1 / 28

Signature Buffer: Bridging Performance Gap between Registers and Caches

Signature Buffer: Bridging Performance Gap between Registers and Caches. Lu Peng, Jih-Kwon Peir, Konrad Lai. Introduction. Two types of storage Registers Fast and small Supply data for operations Memory Large and slow Cache for recently used data

bianca
Download Presentation

Signature Buffer: Bridging Performance Gap between Registers and Caches

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Signature Buffer: Bridging Performance Gap between Registers and Caches Lu Peng, Jih-Kwon Peir, Konrad Lai

  2. Introduction • Two types of storage • Registers • Fast and small • Supply data for operations • Memory • Large and slow • Cache for recently used data • Most RISC only operates on data from registers • Data communication path • Producer -> store -> load -> consumer

  3. Introduction • Future processors with 35nm technology • 10 GHz clock • 64 KB L1 cache • 3-7 cycles L1 cache access time • IPC degrades by 3.5% per additional cycle on L1 cache access time

  4. Signature Buffer • Zero-cycle load • “The load and its dependent instructions can be fetched, dispatched and executed at the same time” • Avoid address calculation • Each load and store uses a signature for accessing the storage • The signature buffer can be accessed in early pipeline stages • A signature consists of, • Color of the base register • Displacement value

  5. Outline • Motivation • Implementation • Performance evaluation

  6. Motivation –Memory Reference Correlations • Signature correlations • Store-load and load-load can be correlated directly by the signature • Signature reference locality • Nearby memory references often differ by small displacement value with the same base register

  7. Example 1 Signature correlations Signature reference locality Source and Assembly Codes of Function copy_disjunct from Parser

  8. Example 2 Source and Assembly Codes of Function bsW from Bzip

  9. Signature Buffer

  10. Signature Buffer Initial State 0 1 2 3 32

  11. Signature Buffer 0 1 2 -> 32 1 100 3 32 -> 33 1 -- 100

  12. Data Alignment

  13. Data Alignment SB Directory SB Data Array L1 Data Array L1 Tag Array Requests (Signature): A-001 -> A-101 -> B-010 -> X-000(Real Address) : C-100 D-000 D-101 D-000

  14. Data Alignment SB Directory SB Data Array L1 Data Array L1 Tag Array SB MISS! Requests (Signature): A-001 -> A-101 -> B-010 -> X-000(Real Address) : C-100 D-000 D-101 D-000

  15. Data Alignment SB Directory SB Data Array L1 Data Array L1 Tag Array SB MISS! Requests (Signature): A-001 -> A-101 -> B-010 -> X-000(Real Address) : C-100 D-000 D-101 D-000

  16. Data Alignment SB Directory SB Data Array L1 Data Array L1 Tag Array SB MISS! Requests (Signature): A-001 -> A-101 -> B-010 -> X-000(Real Address) : C-100 D-000 D-101 D-000

  17. Data Alignment SB Directory SB Data Array L1 Data Array L1 Tag Array SB MISS!Invalidate high A, low B Requests (Signature): A-001 -> A-101 -> B-010 -> X-000(Real Address) : C-100 D-000 D-101 D-000

  18. Microarchitecture • Bypass I • SB hit or an early store-load forwarding • Bypass II • Normal store-load forwarding

  19. Microarchitecture

  20. Performance Evaluation

  21. Performance Evaluation –IPC SB – nospec 13% speedup SB – perfect 14% speedup

  22. Performance Evaluation –Load Distribution Normal S-L Forw. & L1 access reduced t0 30%, 70% of loads benefit from SB SB With perfect memory dependence predictor obtains 23% zero-cycle load

  23. Performance Evaluation –SB Hit Ratio Average SB hit rate is about 51%

  24. Performance Evaluation –Comparison with L0 Cache Performance benefit of SB goes up with L1 latencyand always above having a L0 cache

  25. Performance Evaluation –Comparison with L0 Cache Larger L0 => higher hit rate SB is less sensitive to size.

  26. Advantages • Non-speculative • Data obtained from the SB without intervening stores is always correct • All loads can access the data from the SB without any restriction on the type of the loads or base registers. • Loads through the SB can bypass the address generation and cache access completely. • Store/Load correlation is established from the instruction encoding bits to simplify hardware requirement. • SB uses line-based granularity to capture spatial locality.

  27. Questions?

  28. Loads – SB Specific • Early S-L forwarding • A load has identical signature with an early store in the LSQ with no intervening store in between. (zero-cycle load & SB hit) • Early SB access • SB is accessed after a load is fetched and decoded (zero-cycle load & SB hit) • Delayed SB access • SB is accessed after memory dependence resolutions because of intervening stores (SB hit) • Non-Signature Forwarding • Consecutive SB misses to the same SB line gets forwarded data from previous misses (SB miss)

More Related