1 / 18

Lecture 4: Directory Protocols

Lecture 4: Directory Protocols. Topics: directory-based cache coherence implementations. Split Transaction Bus. What would it take to implement the protocol correctly while assuming a split transaction bus? Split transaction bus: a cache puts out a request, releases

Download Presentation

Lecture 4: Directory Protocols

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture 4: Directory Protocols • Topics: directory-based cache coherence implementations

  2. Split Transaction Bus • What would it take to implement the protocol correctly • while assuming a split transaction bus? • Split transaction bus: a cache puts out a request, releases • the bus (so others can use the bus), receives its response • much later • Assumptions: • only one request per block can be outstanding • separate lines for addr (request) and data (response)

  3. Split Transaction Bus Proc 1 Proc 2 Proc 3 Cache Cache Cache Request lines Response lines

  4. Design Issues • When does the snoop complete? What if the snoop takes • a long time? • What if the buffer in a processor/memory is full? When • does the buffer release an entry? Are the buffers identical? • How does each processor ensure that a block does not • have multiple outstanding requests? • What determines the write order – requests or responses?

  5. Design Issues II • What happens if a processor is arbitrating for the bus and • witnesses another bus transaction for the same address? • If the processor issues a read miss and there is already a • matching read in the request table, can we reduce bus • traffic?

  6. Scalable Multiprocessors P1 P2 Pn C1 C2 Cn Mem 1 CA1 Mem 2 CA2 Mem n CAn Scalable interconnection network CC NUMA: Cache coherent non-uniform memory access

  7. Directory-Based Protocol • For each block, there is a centralized “directory” that • maintains the state of the block in different caches • The directory is co-located with the corresponding memory • Requests and replies on the interconnect are no longer • seen by everyone – the directory serializes writes P P C C Dir Mem CA Dir Mem CA

  8. Definitions • Home node: the node that stores memory and directory • state for the cache block in question • Dirty node: the node that has a cache copy in modified state • Owner node: the node responsible for supplying data • (usually either the home or dirty node) • Also, exclusive node, local node, requesting node, etc. P P C C Dir Mem CA Dir Mem CA

  9. Protocol Steps P1 P2 Pn C1 C2 Cn Dir Mem 1 CA1 Dir Mem 2 CA2 Dir Mem n CAn Scalable interconnection network • What happens on a read miss and a write miss? • How is information stored in a directory?

  10. Directory Organizations • Centralized Directory: one fixed location – bottleneck! • Flat Directories: directory info is in a fixed place, • determined by examining the address – can be further • categorized as memory-based or cache-based • Hierarchical Directories: the processors are organized as a • logical tree structure and each parent keeps track of which • of its immediate children has a copy of the block – less • storage (?), more searching, can exploit locality

  11. Flat Memory-Based Directories • Directory is associated with memory and stores info • for all cache copies • A presence vector stores a bit for every processor, for • every memory block – the overhead is a function of • memory/block size and #processors • Reducing directory overhead:

  12. Flat Memory-Based Directories • Directory is associated with memory and stores info • for all cache copies • A presence vector stores a bit for every processor, for • every memory block – the overhead is a function of • memory/block size and #processors • Reducing directory overhead: • Width: pointers (keep track of processor ids of sharers) (need overflow strategy), 2-level protocol to combine info for multiple processors • Height: increase block size, track info only for blocks that are cached (note: cache size << memory size)

  13. Flat Cache-Based Directories • The directory at the memory home node only stores a • pointer to the first cached copy – the caches store • pointers to the next and previous sharers (a doubly linked • list) Main memory Cache 7 Cache 3 Cache 26

  14. Flat Cache-Based Directories • The directory at the memory home node only stores a • pointer to the first cached copy – the caches store • pointers to the next and previous sharers (a doubly linked • list) • Potentially lower storage, no bottleneck for network traffic, • Invalidates are now serialized (takes longer to acquire • exclusive access), replacements must update linked list, • must handle race conditions while updating list

  15. Data Sharing Patterns • Two important metrics that guide our design choices: • invalidation frequency and invalidation size – turns out • that invalidation size is rarely greater than four • Read-only data: constantly read, never updated (raytrace) • Producer-consumer: flag-based synchronization, updates • from neighbors (Ocean) • Migratory: reads and writes from a single processor for a • period of time (global sum) • Irregular: unpredictable accesses (distributed task queue)

  16. Protocol Optimizations 3 C1 C2 C1 attempts to read a block that is in Modified state in C2 4 2 1 5 Mem Request Response 3 C1 C2 C1 C2 2 2 4 3 4 1 1 Mem Mem Intervention Forwarding Reply Forwarding

  17. Serializing Writes for Coherence • Potential problems: updates may be re-ordered by the • network; General solution: do not start the next write until • the previous one has completed • Strategies for buffering writes: • buffer at home: requires more storage at home node • buffer at requestors: the request is forwarded to the previous requestor and a linked list is formed • NACK and retry: the home node nacks all requests until the outstanding request has completed

  18. Title • Bullet

More Related