CS1104 – Computer Organization

CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy

Topics • Memories: Types • Memory Hierarchy: why? • Basics of caches • Measuring cache performance • Improving cache performance • Framework for memory hierarchies

Memories: Review • SRAM: • value is stored on a pair of inverting gates • very fast but takes up more space than DRAM (4 to 6 transistors) • access time: 5-25 ns • Cost (US$) per MByte in 1997: 100 to 250 • DRAM: • value is stored as a charge on capacitor (must be refreshed) • very small but slower than SRAM (factor of 5 to 10) • access time: 60-120 ns • Cost (US$) per MByte in 1997: 5 to 10

Memory Hierarchy: why? • Users want large and fast memories! SRAM access times are 2 - 25ns at cost of $100 to $250 per Mbyte.DRAM access times are 60-120ns at cost of $5 to $10 per Mbyte.Disk access times are 10 to 20 million ns at cost of $.10 to $.20 per Mbyte. • Try and give it to them anyway • build a memory hierarchy 1997 CPU Level 1 Level 2 Speed Level n Size

Memory Hierarchy: requirements • If level is closer to Processor, it must… • Be smaller • Be faster • Contain a subset (most recently used data) of lower levels beneath it • Contain all the data in higher levels above it • Lowest Level (usually disk or the main memory) contains all the available data

Locality • A principle that makes having a memory hierarchy a good idea • If an item is referenced,temporal locality: it will tend to be referenced again soon spatial locality : nearby items will tend to be referenced soon. • Our initial focus: two levels (upper, lower) • block: minimum unit of data • hit: data requested is in the upper level • miss: data requested is not in the upper level

Exploiting locality Caches

Cache • Two issues: • How do we know if a data item is in the cache? • If it is, how do we find it? • Our first example: • block size is one word of data • "direct mapped" For each item of data at the lower level, there is exactly one location in the cache where it might be. e.g., lots of items at the lower level share locations in the upper level

4 Byte Direct Mapped Cache Cache Index Memory Address Memory 0 0 1 1 2 2 3 3 4 5 6 7 8 9 A B C D E F Direct Mapped Cache • Cache Location 0 can be occupied by data from: • Memory location 0, 4, 8, ... • In general: any memory location that is multiple of 4

Direct Mapped Cache • Mapping: address is modulo the number of blocks in the cache

tttttttttttttttttiiiiiiiiiioooo Issues with Direct Mapped Caches • Since multiple memory addresses map to same cache index, how do we tell which one is in there? • What if we have a block size > 1 byte? • Solution: divide memory address into three fields tag index byte to check to offset if have select within correct block block block

Direct Mapped Caches: Terminology • All fields are read as unsigned integers. • Index: specifies the cache index (which “row” of the cache we should look in) • Offset: once we’ve found correct block, specifies which byte within the block we want • Tag: the remaining bits after offset and index are determined; these are used to distinguish between all the memory addresses that map to the same location

Direct Mapped Cache: Example • Suppose we have a 16KB direct-mapped cache with 4 word blocks. • Determine the size of the tag, index and offset fields if we’re using a 32-bit architecture. • Offset • need to specify correct byte within a block • block contains 4 words = 16 bytes = 24 bytes • need 4 bits to specify correct byte

Direct Mapped Cache: Example • Index • need to specify correct row in cache • cache contains 16 KB = 214 bytes • block contains 24 bytes (4 words) • # rows/cache = # blocks/cache (since there’s one block/row) = bytes/cache bytes/row = 214 bytes/cache 24 bytes/row = 210 rows/cache • need 10 bits to specify this many rows

Direct Mapped Cache: Example • Tag • use remaining bits as tag • tag length = mem addr length - offset - index = 32 - 4 - 10 bits = 18 bits • so tag is leftmost 18 bits of memory address

Direct Mapped Cache Address (bit positions) 3 1 3 0 1 3 1 2 1 1 2 1 0 • For MIPS: B y t e o f f s e t 1 0 2 0 H i t D a t a T a g I n d e x I n d e x V a l i d T a g D a t a 0 1 2 1 0 2 1 1 0 2 2 1 0 2 3 2 0 3 2

#Bits required (example) • 32-bit byte addresses • Direct-mapped cache of size 2n words with one-word (4-byte) blocks • What is the size of the “tag field”? • 32 – (n+2) bits (2 bits for byte offset and n bits for index) • What is the total number of bits in the cache? 2n x (block size + tag size + valid field size) = 2n x (32 + (32 – n – 2) + 1) because the block size is 32 bits = 2n x (63 – n)

Direct Mapped Cache • Taking advantage of spatial locality: Address (bit positions)

00000010 a 00000014 b 00000018 c 0000001C d ... ... ... ... ... ... ... ... Accessing data in a direct mapped cache Memory • Example: 16KB, direct-mapped, 4 word blocks • Read 4 addresses • 0x00000014, 0x0000001C, 0x00000034, 0x00008014 • Memory values on right: • only cache/memory level of hierarchy Address (hex) Value of Word 00000030 e 00000034 f 00000038 g 0000003C h 00008010 i 00008014 j 00008018 k 0000801C l

Accessing data in a direct mapped cache • 4 Addresses: • 0x00000014, 0x0000001C, 0x00000034, 0x00008014 • 4 Addresses divided (for convenience) into Tag, Index, Byte Offset fields 000000000000000000 0000000001 0100 000000000000000000 0000000001 1100 000000000000000000 0000000011 0100 000000000000000010 0000000001 0100 Tag Index Offset

Hits vs. Misses • Read hits • this is what we want! • Read misses • stall the CPU, fetch block from memory, deliver to cache, restart the load instruction • Write hits: • can replace data in cache and memory (write-through) • write the data only into the cache (write-back the cache later) • Write misses: • read the entire block into the cache, then write the word (allocate on write miss) • do not read the cache line; just write to memory (no allocate on write miss)

Hardware Issues • Make reading multiple words easier by using banks of memory • It can get a lot more complicated...

Performance • Increasing the block size tends to decrease miss rate:

Performance • Use split caches because there is more spatial locality in code:

Memory access times • #clock cycles to send the address (say 1) • #clock cycles to initiate each DRAM access (say 15) • #clock cycles to transfer a word of data (say 1) Clock cycles required to access 4 words: 1 + 4x15 + 4x1 1 + 1x15 + 1 1 + 1x15 + 4x1

Improving performance • Two ways of improving performance: • decreasing the miss ratio: associativity • decreasing the miss penalty: multilevel caches

Decreasing miss ratio with associativity 2 blocks / set block 4 blocks / set 8 blocks / set

4-way set-associative cache

Tag size versus associativity Cache of 4K blocks, four word block size (or four-word cache lines), and 32-bit addresses • Direct mapped • Byte offset = 4 bits (each block = 4 words = 16 bytes) • Index + Tag = 32 – 4 = 28 bits • For 4K blocks, 12 index bits are required • #Tag bits for each block = 28 – 12 = 16 • Total #Tag bits = 16 x 4 = 64Kbits • 4-way set-associative • #Sets = 1K, therefore 10 bits index bits are required • #Tag bits for each block = 28 – 10 = 18 • Total #Tag bits = 4 x 18 x 1K = 72Kbits

Block replacement policy In a direct mapped cache, when a miss occurs, the requested block can go only at one position. In a set-associative cache, there can be multiple positions in a set for storing each block. If all the positions are filled, which block should be replaced? • Least Recently Used (LRU) Policy • Randomly choose a block and replace it

Common framework for memory hierarchies Q1: where can a block be placed? • note: a block is in this case a cache line • Direct mapped cache: one position • n-way set-associative cache: n positions (typically 2-8) • Fully associative: everywhere Q2: how is a block found? • Direct mapped: index part of address indicates entry • n-way: use index to search in all the n cache blocks • Fully associative: check all tags

Common framework for memory hierarchies Q3: wich block should be replaced on a miss? • Direct mapped: no choice • Associative caches: use replacement algorithm, like LRU Q4: what happens on a write? • write-through  write-back • on a write miss: allocate  no-allocate

Common framework for memory hierarchies • Understanding (cache) misses:The three Cs • Compulsory miss • Capacity miss • Conflict miss direct mapped (1-way) 2-way miss rate 4-way fully associative capacity compulsory cache size

Reading • 3rd edition of the textbook • Chapter 7, Sections 7.1 – 7.3 and Section 7.5 • 2nd edition of the textbook • Chapter 7, Sections 7.1 – 7.3 and Section 7.5

CS1104 – Computer Organization