Cache Organization

Cache Organization

Rehashing our terms • The Architectural view of memory is: • What the machine language sees • Memory is just a big array of storage • Breaking up the memory system into different pieces – cache, main memory (made up of DRAM) and Disk storage – is not architectural. • The machine language doesn’t know about it • The processor may not know about it • A new implementation may not break it up into the same pieces (or break it up at all). Caching needs to be Transparent!

What’s in a Cache? • Cache memory can copy data from any part of main memory • What does it have to store? • The data • Where it came from (the address) • Terminology: • The TAG holds the memory address • The BLOCK holds the memory data

Cache organization • A cache memory consists of multiple tag/block pairs • Searches can be done in parallel (within reason) • At most one tag will match • If there is a tag match, it is a cache HIT • If there is no tag match, it is a cache MISS Our goal is to keep the data we think will be accessed in the near future in the cache

Cache Terminology • If a block is found in the cache  “hit” • Otherwise  “miss” • Hit rate= (# hits) / (# requests made to the cache) • Miss rate = 1 – Hit rate • Hit time = time to access the cache to see if a block is present + time to get the block to the CPU • Miss time (aka miss penalty) = time to replace a block in cache with one from DRAM • Average Cache access time = hit time + miss rate * miss penalty

Cache Misses • If a block is found in the cache  “hit” • Otherwise  “miss” • Every cache miss will get the data from memory and ALLOCATE a cache block to put the data in.

V V A very simple memory system Processor Cache Memory 2 cache blocks 4 bit tag field 1 byte block size 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 74 110 120 Ld R1  M[ 1 ] Ld R2  M[ 5 ] Ld R3  M[ 1 ] Ld R3  M[ 7 ] Ld R2  M[ 7 ] 130 140 150 160 tag data 170 180 190 200 R0 R1 R2 R3 210 220 230 240 250

0 0 A very simple memory system Processor Cache Memory 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 74 Is it in the cache? 110 No valid tags 120 Ld R1  M[ 1 ] Ld R2  M[ 5 ] Ld R3  M[ 1 ] Ld R3  M[ 7 ] Ld R2  M[ 7 ] 130 140 1 1 110 150 160 tag data 170 180 This is a Cache miss 190 200 R0 R1 R2 R3 Allocate: address  tag Mem[1]  block Mark Valid 210 220 230 240 250

1 0 A very simple memory system Processor Cache Memory 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 74 100 110 120 Ld R1  M[ 1 ] Ld R2  M[ 5 ] Ld R3  M[ 1 ] Ld R3  M[ 7 ] Ld R2  M[ 7 ] 130 140 1 110 150 lru 160 tag data 170 180 190 200 R0 R1 R2 R3 210 110 220 Misses: 1 Hits: 0 230 240 250

1 0 A very simple memory system Processor Cache Memory 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 74 100 Check tags: 5  1 110 120 Cache Miss Ld R1  M[ 1 ] Ld R2  M[ 5 ] Ld R3  M[ 1 ] Ld R3  M[ 7 ] Ld R2  M[ 7 ] 130 140 1 110 150 lru 1 5 150 160 tag data 170 180 190 200 R0 R1 R2 R3 210 110 220 Misses: 1 Hits: 0 230 240 250

1 1 150 A very simple memory system Processor Cache Memory 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 74 100 110 120 Ld R1  M[ 1 ] Ld R2  M[ 5 ] Ld R3  M[ 1 ] Ld R3  M[ 7 ] Ld R2  M[ 7 ] 130 140 lru 1 110 150 5 150 160 tag data 170 180 190 200 R0 R1 R2 R3 210 110 220 Misses: 2 Hits: 0 230 240 250

1 1 A very simple memory system Processor Cache Memory 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 100 74 Check tags: 1 = 1 (HIT!) 110 120 Ld R1  M[ 1 ] Ld R2  M[ 5 ] Ld R3  M[ 1 ] Ld R3  M[ 7 ] Ld R2  M[ 7 ] 130 140 lru 1 110 150 5 150 160 tag data 170 180 190 200 R0 R1 R2 R3 210 110 220 Misses: 2 Hits: 0 150 230 240 250

1 1 110 A very simple memory system Processor Cache Memory 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 74 100 110 120 Ld R1  M[ 1 ] Ld R2  M[ 5 ] Ld R3  M[ 1 ] Ld R3  M[ 7 ] Ld R2  M[ 7 ] 130 140 1 110 150 lru 5 150 160 tag data 170 180 190 200 R0 R1 R2 R3 210 110 220 Misses: 2 Hits: 1 150 230 240 250

1 1 A very simple memory system Processor Cache Memory 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 74 100 Where does it go??? 110 120 Ld R1  M[ 1 ] Ld R2  M[ 5 ] Ld R3  M[ 1 ] Ld R3  M[ 7 ] Ld R2  M[ 7 ] 130 140 1 110 150 lru 5 150 160 tag data 170 180 190 200 R0 R1 R2 R3 210 110 220 Misses: 2 Hits: 1 150 230 110 240 250

Cache Misses • Every cache miss will get the data from memory and ALLOCATE a cache block to put the data in. • Which spot should be allocated? • What if the cache is full? • Kick someone else out? • Which one? Random? -- OK, but hard to grade test questions • Better than random? How?

Picking the most likely addresses • What is the probability of accessing a given memory location? • With no information, it is just as likely as any other address • Q: Are programs random? • A: No! • They tend to use the same memory locations over and over. • We can use this to pick the most referenced locations to put into the cache

Locality of Reference • A program does not access all of its data & code with equal probability • (not even close) • Principle of locality of reference: • Programs access a relatively small portion of their address space during any given window of time – applies to both instructions and data • Temporal locality: if an item was recently used, it will probably be used again soon • Spatial locality: if an item was recently referenced, nearby items will probably also be referenced soon

Using locality in the cache • How does this affect our cache design? • Temporal locality says any new miss data should have priority to be placed into the cache • It is the most recent reference location • Temporal locality also says that the least recently referenced (or least recently used – LRU ) cache line should be evicted to make room for the new line. • Because the re-access probability falls over time as a cache line isn’t referenced, the LRU line is least likely to be re-referenced. • If we haven’t used it in a while, throw it away

1 1 A very simple memory system Processor Cache Memory 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 100 74 110 120 Ld R1  M[ 1 ] Ld R2  M[ 5 ] Ld R3  M[ 1 ] Ld R3  M[ 7 ] Ld R2  M[ 7 ] 130 140 1 110 150 lru 5 150 160 tag data 170 180 190 200 R0 R1 R2 R3 210 110 220 Misses: 2 Hits: 1 150 230 110 240 250

1 1 7 170 170 A very simple memory system Processor Cache Memory 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 100 74 7  5 and 7  1 (MISS!) 110 120 Ld R1  M[ 1 ] Ld R2  M[ 5 ] Ld R3  M[ 1 ] Ld R3  M[ 7 ] Ld R2  M[ 7 ] 130 140 1 110 150 lru 5 150 160 tag data 170 180 190 200 R0 R1 R2 R3 210 110 220 Misses: 2 Hits: 1 150 230 110 240 250

1 1 A very simple memory system Processor Cache Memory 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 100 74 110 120 Ld R1  M[ 1 ] Ld R2  M[ 5 ] Ld R3  M[ 1 ] Ld R3  M[ 7 ] Ld R2  M[ 7 ] 130 140 lru 1 110 150 7 170 160 tag data 170 180 190 200 R0 R1 R2 R3 210 110 220 Misses: 3 Hits: 1 150 230 170 240 250

1 1 170 A very simple memory system Processor Cache Memory 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 74 100 7  1 and 7 = 7 (HIT!) 110 120 Ld R1  M[ 1 ] Ld R2  M[ 5 ] Ld R3  M[ 1 ] Ld R3  M[ 7 ] Ld R2  M[ 7 ] 130 140 lru 1 110 150 7 170 160 tag data 170 180 190 200 R0 R1 R2 R3 210 110 220 Misses: 3 Hits: 1 150 230 170 240 250

1 1 A very simple memory system Processor Cache Memory 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 100 74 110 120 Ld R1  M[ 1 ] Ld R2  M[ 5 ] Ld R3  M[ 1 ] Ld R3  M[ 7 ] Ld R2  M[ 7 ] 130 140 lru 1 110 150 7 170 160 tag data 170 180 190 200 R0 R1 R2 R3 210 110 220 Misses: 3 Hits: 2 170 230 170 240 250

Calculating Average Access Latency • Avg latency = cache latency + memory latency  miss rate • Avg latency = 1 cycle + 15 cycles  (3/5) = 11 cycles per reference • To improve average latency: • Improve memory access latency, or • Improve cache access latency, or • Improve cache hit rate

Calculating Cost • How much does a cache cost? • Calculate storage requirements • 2 bytes of SRAM • Calculate overhead to support access (tags) • 2 4-bit tags = 1 byte of SRAM • The cost of the tags is often forgotten for caches, but this cost drives the design of real caches • What is the cost if a 32 bit address is used?

How can we reduce the overhead? • Have a small address • Impractical, and caches are supposed to be non-architectural • Cache bigger units than bytes • Each block has a single tag, and blocks can be whatever size we choose.

Spatial Locality • Spatial locality in a program says that if we reference a memory location (e.g., 1000), we are more likely to reference a location near it (e.g. 1001) than some random location. • Think: Arrays, or variables on the stack • Approach: When we access an address, also cache data around it

Overhead? • This helps reduce the tag (address) overhead on two fronts: • We only need to store one tag for each block • This increases the ratio of data : tag • If we align our blocks, we can reduce how much of the address we need to store

Tag size vs Block size • If our address space is 32-bit…

Block size for caches Processor Cache Memory Block # 2 cache sets 2 byte block 3 bit tag field 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 100 110 120 Ld R1  M[ 1 ] Ld R2  M[ 5 ] Ld R3  M[ 1 ] Ld R3  M[ 4 ] Ld R2  M[ 0 ] 130 tag data 140 V 150 160 V 170 180 190 200 R0 R1 R2 R3 210 220 230 240 250

Block size for caches Processor Cache Memory 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 100 110 120 Ld R1  M[ 1 ] Ld R2  M[ 5 ] Ld R3  M[ 1 ] Ld R3  M[ 4 ] Ld R2  M[ 0 ] 130 tag data 140 0 150 160 0 170 180 190 200 R0 R1 R2 R3 210 220 230 240 250

Addr: 0001 Block size for caches Processor Cache Memory 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 100 110 120 Ld R1  M[ 1 ] Ld R2  M[ 5 ] Ld R3  M[ 1 ] Ld R3  M[ 4 ] Ld R2  M[ 0 ] 130 tag data 140 1 0 100 150 110 160 lru 0 170 180 block offset 190 200 R0 R1 R2 R3 210 110 220 Misses: 1 Hits: 0 230 240 250

Block size for caches Processor Cache Memory 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 100 110 120 Ld R1  M[ 1 ] Ld R2  M[ 5 ] Ld R3  M[ 1 ] Ld R3  M[ 4 ] Ld R2  M[ 0 ] 130 tag data 140 1 0 100 150 110 160 lru 0 170 180 190 200 R0 R1 R2 R3 210 110 220 Misses: 1 Hits: 0 230 240 250

Block size for caches Processor Cache Memory 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 100 110 120 Ld R1  M[ 1 ] Ld R2  M[ 5 ] Ld R3  M[ 1 ] Ld R3  M[ 4 ] Ld R2  M[ 0 ] 130 tag data 140 lru 1 0 100 150 110 160 1 2 140 170 150 180 190 block offset Addr: 0101 200 R0 R1 R2 R3 210 110 220 Misses: 2 Hits: 0 150 230 240 250

Block size for caches Processor Cache Memory 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 100 110 120 Ld R1  M[ 1 ] Ld R2  M[ 5 ] Ld R3  M[ 1 ] Ld R3  M[ 4 ] Ld R2  M[ 0 ] 130 tag data 140 lru 1 0 100 150 110 160 1 2 140 170 150 180 190 200 R0 R1 R2 R3 210 110 220 Misses: 2 Hits: 0 150 230 240 250

Block size for caches Processor Cache Memory 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 100 110 120 Ld R1  M[ 1 ] Ld R2  M[ 5 ] Ld R3  M[ 1 ] Ld R3  M[ 4 ] Ld R2  M[ 0 ] 130 tag data 140 1 0 100 150 110 160 lru 1 2 140 170 150 180 190 200 R0 R1 R2 R3 210 110 220 Misses: 2 Hits: 1 150 230 110 240 250

Block size for caches Processor Cache Memory 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 100 110 120 Ld R1  M[ 1 ] Ld R2  M[ 5 ] Ld R3  M[ 1 ] Ld R3  M[ 4 ] Ld R2  M[ 0 ] 130 tag data 140 lru 1 0 100 150 110 160 1 2 140 170 150 180 190 200 R0 R1 R2 R3 210 110 220 Misses: 2 Hits: 2 150 230 140 240 250

Block size for caches Processor Cache Memory 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 100 110 120 Ld R1  M[ 1 ] Ld R2  M[ 5 ] Ld R3  M[ 1 ] Ld R3  M[ 4 ] Ld R2  M[ 0 ] 130 tag data 140 1 0 100 150 110 160 lru 1 2 140 170 150 180 190 200 R0 R1 R2 R3 210 110 220 Misses: 2 Hits: 3 140 100 230 140 240 250

Cache Organization