CENG 3420 Computer Design Spring 2011 Lecture 13: Memory Hierarchy

CENG 3420Computer DesignSpring 2011Lecture 13: Memory Hierarchy XU, Qiang (Johnny) 徐強 [Adapted from UC Berkeley’s D. Patterson’s and from PSU’s Mary J. Irwin’s slides with additional credits to Y. Xie]

Cache Main Memory Secondary Memory (Disk) Review: Major Components of a Computer Processor Devices Control Input Memory Datapath Output

µProc 55%/year (2X/1.5yr) DRAM 7%/year (2X/10yrs) Processor-Memory Performance Gap “Moore’s Law” Processor-Memory Performance Gap(grows 50%/year)

The “Memory Wall” • Logic vs DRAM speed gap continues to grow Clocks per DRAM access Clocks per instruction

Memory Performance Impact on Performance • Suppose a processor executes at • ideal CPI = 1.1 • 50% arith/logic, 30% ld/st, 20% control and that 10% of data memory operations miss with a 50 cycle miss penalty • CPI = ideal CPI + average stalls per instruction = 1.1(cycle) + ( 0.30 (datamemops/instr) x 0.10 (miss/datamemop) x 50 (cycle/miss) ) = 1.1 cycle + 1.5 cycle = 2.6 so 58% of the time the processor is stalled waiting for memory! • A 1% instruction miss rate would add an additional 0.5 to the CPI!

The Memory Hierarchy Goal • Fact: Large memories are slow and fast memories are small • How do we create a memory that gives the illusion of being large, cheap and fast (most of the time)? • With hierarchy • With parallelism

Memory Hierarchy Technologies • Random Access Memories (RAMs) • “Random” is good: access time is the same for all locations • DRAM: Dynamic Random Access Memory • High density (1 transistor cells), low power, cheap, slow • Dynamic: need to be “refreshed” regularly (~ every 4 ms) • SRAM: Static Random Access Memory • Low density (6 transistor cells), high power, expensive, fast • Static: content will last “forever” (until power turned off) • Size: DRAM/SRAM ratio of 4 to 8 • Cost/Cycle time: SRAM/DRAM ratio of 8 to 16 • “Non-so-random” Access Technology • Access time varies from location to location and from time to time (e.g., disk, CDROM)

RAM Memory Uses and Performance Metrics • Caches use SRAM for speed • Main Memory is DRAM for density • Addresses divided into 2 halves (row and column) • RASor Row Access Strobe triggering row decoder • CAS or Column Access Strobe triggering column selector • Memory performance metrics • Latency: Time to access one word • Access Time: time between request and when word is read or written (read access and write access times can be different) • Cycle Time: time between successive (read or write) requests • Usually cycle time > access time • Bandwidth: How much data can be supplied per unit time • width of the data channel * the rate at which it can be used

A Typical Memory Hierarchy • By taking advantage of the principle of locality • Can present the user with as much memory as is available in the cheapest technology • at the speed offered by the fastest technology On-Chip Components Control eDRAM Secondary Memory (Disk) Instr Cache Second Level Cache (SRAM) ITLB Main Memory (DRAM) Datapath Data Cache RegFile DTLB Speed (%cycles): ½’s 1’s 10’s 100’s 1,000’s Size (bytes): 100’s K’s 10K’s M’s G’s to T’s Cost: highest lowest

Memory Hierarchy Technologies • Caches use SRAM for speed and technology compatibility • Low density (6 transistor cells), high power, expensive, fast • Static: content will last “forever” (until power turned off) 21 Address Chip select 16 SRAM 2M x 16 Output enable Dout[15-0] Write enable Din[15-0] 16 • Main Memory uses DRAM for size (density) • High density (1 transistor cells), low power, cheap, slow • Dynamic: needs to be “refreshed” regularly (~ every 8 ms) • 1% to 2% of the active cycles of the DRAM • Addresses divided into 2 halves (row and column) • RASor Row Access Strobe triggering row decoder • CAS or Column Access Strobe triggering column selector

bit (data) lines Each intersection represents a 6-T SRAM cell or a 1-T DRAM cell word (row) line Classical RAM Organization (~Square) R o w D e c o d e r RAM Cell Array Column Selector & I/O Circuits column address row address One memory row holds a block of data, so the column address selects the requested bit or word from that block data bit or word

RAM Cell Array Classical DRAM Organization (~Square Planes) bit (data) lines The column address selects the requested bit from the row in each plane . . . R o w D e c o d e r Each intersection represents a 1-T DRAM cell word (row) line column address Column Selector & I/O Circuits row address . . . data bit data bit data bit data word

N cols Cycle Time 1st M-bit Access 2nd M-bit Access RAS CAS Row Address Col Address Row Address Col Address Classical DRAM Operation Column Address • DRAM Organization: • N rows x N column x M-bit • Read or Write M-bit at a time • Each M-bit access requiresa RAS / CAS cycle DRAM Row Address N rows M bit planes M-bit Output

N x M SRAM M bit planes Cycle Time 1st M-bit Access 2nd M-bit 3rd M-bit 4th M-bit RAS CAS Row Address Col Address Col Address Col Address Col Address Page Mode DRAM Operation Column Address • Page Mode DRAM • N x M SRAM to save a row N cols • After a row is read into the SRAM “register” • Only CAS is needed to access other M-bit words on that row • RAS remains asserted while CAS is toggled DRAM Row Address N rows M-bit Output

N x M SRAM M bit planes Synchronous DRAM (SDRAM) Operation Column Address +1 • After a row is read into the SRAM register • Inputs CAS as the starting “burst” address along with a burst length • Transfers a burst of data from a series of sequential addresses within that row • A clock controls transfer of successive words in the burst – 300MHz in 2004 N cols DRAM Row Address N rows M-bit Output Cycle Time 1st M-bit Access 2nd M-bit 3rd M-bit 4th M-bit RAS CAS Col Address Row Add Row Address

Other DRAM Architectures • Double Data Rate SDRAMs – DDR-SDRAMs (and DDR-SRAMs) • Double data rate because they transfer data on both the rising and falling edge of the clock • Are the most widely used form of SDRAMs • DDR2-SDRAMs http://www.corsairmemory.com/corsair/products/tech/memory_basics/153707/main.swf

DRAM Memory Latency & Bandwidth Milestones • In the time that the memory to processor bandwidthdoubles the memorylatency improves by a factor of only 1.2 to 1.4 • To deliver such high bandwidth, the internal DRAM has to be organized as interleaved memory banks Patterson, CACM Vol 47, #10, 2004

Memory Systems that Support Caches • The off-chip interconnect and memory architecture can affect overall system performance in dramatic ways on-chip One word wide organization (one word wide bus and one word wide memory) CPU • Assume • 1 clock cycle to send the address • 25 clock cycles for DRAM cycle time, 8 clock cycles access time • 1 clock cycle to return a word of data • Memory-Bus to Cache bandwidth • number of bytes accessed from memory and transferred to cache/CPU per clock cycle Cache bus 32-bit data & 32-bit addr per cycle Memory

One Word Wide Memory Organization • If the block size is one word, then for a memory access due to a cache miss, the pipeline will have to stall the number of cycles required to return one data word from memory cycle to send address cycles to read DRAM cycle to return data total clock cycles miss penalty • Number of bytes transferred per clock cycle (bandwidth) for a single miss is bytes per clock on-chip CPU 1 25 1 27 Cache bus Memory 4/27 = 0.148

25 cycles 25 cycles 25 cycles 25 cycles One Word Wide Memory Organization, con’t • What if the block size is four words? cycle to send 1st address cycles to read DRAM cycles to return last data word total clock cycles miss penalty • Number of bytes transferred per clock cycle (bandwidth) for a single miss is bytes per clock on-chip 1 4 x 25 = 100 1 102 CPU Cache bus Memory (4 x 4)/102 = 0.157

25 cycles 8 cycles 8 cycles 8 cycles One Word Wide Memory Organization, con’t • What if the block size is four words and if a fast page mode DRAM is used? cycle to send 1st address cycles to read DRAM cycles to return last data word total clock cycles miss penalty • Number of bytes transferred per clock cycle (bandwidth) for a single miss is bytes per clock on-chip CPU 1 25 + 3*8 = 49 1 51 Cache bus Memory (4 x 4)/51 = 0.314

25 cycles 25 cycles 25 cycles 25 cycles Interleaved Memory Organization • For a block size of four words cycle to send 1st address cycles to read DRAM cycles to return last data word total clock cycles miss penalty on-chip 1 25 + 3 = 28 1 30 CPU Cache bus Memory bank 0 Memory bank 1 Memory bank 2 Memory bank 3 • Number of bytes transferred per clock cycle (bandwidth) for a single miss is bytes per clock (4 x 4)/30 = 0.533

DRAM Memory System Summary • It’s important to match the cache characteristics • caches access one block at a time (usually more than one word) • with the DRAM characteristics • use DRAMs that support fast multiple word accesses, preferably ones that match the block size of the cache • with the memory-bus characteristics • make sure the memory-bus can support the DRAM access rates and patterns • with the goal of increasing the Memory-Bus to Cache bandwidth

The Cache • Two questions to answer (in hardware): • Q1: How do we know if a data item is in the cache? • Q2: If it is, how do we find it? • Direct mapped • For each item of data at the lower level, there is exactly one location in the cache where it might be - so lots of items at the lower level must share locations in the upper level • Address mapping: (block address) modulo (# of blocks in the cache) • First consider block sizes of one word

Caching: A Simple First Example Main Memory 0000xx 0001xx 0010xx 0011xx 0100xx 0101xx 0110xx 0111xx 1000xx 1001xx 1010xx 1011xx 1100xx 1101xx 1110xx 1111xx Two low order bits define the byte in the word (32b words) Cache Index Valid Tag Data 00 01 10 11 Q2: How do we find it? Use next 2 low order memory address bits – the index – to determine which cache block (i.e., modulo the number of blocks in the cache) Q1: Is it there? Compare the cache tag to the high order 2 memory address bits to tell if the memory block is in the cache (block address) modulo (# of blocks in the cache)

01 4 11 15 Direct Mapped Cache • Consider the main memory word reference string 0 1 2 3 4 3 4 15 Start with an empty cache - all blocks initially marked as not valid 0 miss 1 miss 2 miss 3 miss 00 Mem(0) 00 Mem(1) 00 Mem(0) 00 Mem(0) 00 Mem(1) 00 Mem(2) 00 Mem(0) 00 Mem(1) 00 Mem(2) 00 Mem(3) miss 3 hit 4 hit 15 miss 4 00 Mem(0) 00 Mem(1) 00 Mem(2) 00 Mem(3) 01 Mem(4) 00 Mem(1) 00 Mem(2) 00 Mem(3) 01 Mem(4) 00 Mem(1) 00 Mem(2) 00 Mem(3) 01 Mem(4) 00 Mem(1) 00 Mem(2) 00 Mem(3) • 8 requests, 6 misses

Byte offset 31 30 . . . 13 12 11 . . . 2 1 0 Tag 20 Data 10 Hit Index Index Valid Tag Data 0 1 2 . . . 1021 1022 1023 20 32 MIPS Direct Mapped Cache Example • One word/block, cache size = 1K words What kind of locality are we taking advantage of?

Handling Cache Hits • Read hits (I$ and D$) • this is what we want! • Write hits (D$ only) • allow cache and memory to be inconsistent • write the data only into the cache (then write-back the cache contents to the memory when that cache block is “evicted”) • need a dirty bit for each cache block to tell if it needs to be written back to memory when it is evicted • require the cache and memory to be consistent • always write the data into both the cache and the memory (write-through) • don’t need a dirty bit • writes run at the speed of the main memory - slow! – or can use a write buffer, so only have to stall if the write buffer is full

Time (clock cycles) Inst 0 I n s t r. O r d e r D$ Reg D$ D$ D$ D$ Reg Reg Reg Reg Reg Reg Reg Reg Reg I$ I$ I$ I$ I$ ALU ALU ALU ALU ALU Inst 1 Inst 2 Inst 3 Inst 4 Review: Why Pipeline? For Throughput! • To avoid a structural hazard need two caches on-chip: one for instructions (I$) and one for data (D$) To keep the pipeline running at its maximum rate both I$ and D$ need to satisfy a request from the datapath every cycle. What happens when they can’t do that?

01 4 00 01 0 4 00 0 01 4 00 0 01 4 Another Reference String Mapping • Consider the main memory word reference string 0 4 0 4 0 4 0 4 Start with an empty cache - all blocks initially marked as not valid miss miss miss miss 0 4 0 4 00 Mem(0) 00 Mem(0) 01 Mem(4) 00 Mem(0) 4 0 4 0 miss miss miss miss 01 Mem(4) 00 Mem(0) 01 Mem(4) 00 Mem(0) • 8 requests, 8 misses • Ping pong effect due to conflict misses - two memory locations that map into the same cache block

Sources of Cache Misses • Compulsory (cold start or process migration, first reference): • First access to a block, “cold” fact of life, not a whole lot you can do about it • If you are going to run “millions” of instruction, compulsory misses are insignificant • Conflict (collision) • Multiple memory locations mapped to the same cache location • Solution 1: increase cache size • Solution 2: increase associativity • Capacity • Cache cannot contain all blocks accessed by the program • Solution: increase cache size

Handling Cache Misses • Read misses (I$ and D$) • stall the pipeline, fetch the block from the next level in the memory hierarchy, write the word+tag in the cache and send the requested word to the processor, let the pipeline resume • Write misses (D$ only) • stall the pipeline, fetch the block from next level in the memory hierarchy, install it in the cache (may involve having to evict a dirty block if using a write-back cache), write the word+tag in the cache, let the pipeline resume or (normally used in write-back caches) • Write allocate – just write the word+tag into the cache (may involve having to evict a dirty block), no need to check for cache hit, no need to stall or (normally used in write-through caches with a write buffer) • No-write allocate – skip the cache write (but must invalidate that cache block since it will now hold stale data) and just write the word to the write buffer (and eventually to the next memory level), no need to stall if the write buffer isn’t full

Byte offset Hit 31 30 . . . 13 12 11 . . . 4 3 2 1 0 Data 20 Tag 8 Block offset Index Data Index Valid Tag 0 1 2 . . . 253 254 255 20 32 Multiword Block Direct Mapped Cache • Four words/block, cache size = 1K words What kind of locality are we taking advantage of?

0 1 2 3 4 3 11 01 5 15 14 4 4 15 Taking Advantage of Spatial Locality • Let cache block hold more than one word 0 1 2 3 4 3 4 15 Start with an empty cache - all blocks initially marked as not valid miss hit miss 00 Mem(1) Mem(0) 00 Mem(1) Mem(0) 00 Mem(1) Mem(0) 00 Mem(3) Mem(2) hit miss hit 00 Mem(1) Mem(0) 00 Mem(1) Mem(0) 01 Mem(5) Mem(4) 00 Mem(3) Mem(2) 00 Mem(3) Mem(2) 00 Mem(3) Mem(2) hit miss 01 Mem(5) Mem(4) 01 Mem(5) Mem(4) 00 Mem(3) Mem(2) 00 Mem(3) Mem(2) • 8 requests, 4 misses

Miss Rate vs Block Size vs Cache Size • Miss rate goes up if the block size becomes a significant fraction of the cache size because the number of blocks that can be held in the same size cache is smaller (increasing capacity misses)

Average Access Time Miss Rate Miss Penalty Exploits Spatial Locality Increased Miss Penalty & Miss Rate Fewer blocks compromises Temporal Locality Block Size Block Size Block Size Block Size Tradeoff • Larger block sizes take advantage of spatial locality but • If the block size is too big relative to the cache size, the miss rate will go up • Larger block size means larger miss penalty • Latency to first word in block + transfer time for remaining words • In general, Average Memory Access Time = Hit Time + Miss Penalty x Miss Rate

Multiword Block Considerations • Read misses (I$ and D$) • Processed the same as for single word blocks – a miss returns the entire block from memory • Miss penalty grows as block size grows • Early restart – datapath resumes execution as soon as the requested word of the block is returned • Requested word first – requested word is transferred from the memory to the cache (and datapath) first • Nonblocking cache – allows the datapath to continue to access the cache while the cache is handling an earlier miss • Write misses (D$) • Can’t use write allocate or will end up with a “garbled” block in the cache (e.g., for 4 word blocks, a new tag, one word of data from the new block, and three words of data from the old block), so must fetch the block from memory first and pay the stall time

Other Ways to Reduce Cache Miss Rates • Allow more flexible block placement • In a direct mappedcache a memory block maps to exactly one cache block • At the other extreme, could allow a memory block to be mapped to any cache block – fully associative cache • A compromise is to divide the cache into sets each of which consists of n “ways” (n-way set associative) • Use multiple levels of caches • Add a second level of caches on chip – normally a unified L2 cache (i.e., it holds both instructions and data) • L1 caches focuses on minimizing hit time in support of a shorter clock cycle (smaller with smaller block sizes) • L2 cache focuses on reducing miss rate to reduce the penalty of long main memory access times (larger with larger block sizes)

Cache Summary • The Principle of Locality: • Program likely to access a relatively small portion of the address space at any instant of time • Temporal Locality: Locality in Time • Spatial Locality: Locality in Space • Three major categories of cache misses: • Compulsory misses: sad facts of life, e.g., cold start misses • Conflict misses: increase cache size and/or associativity Nightmare Scenario: ping pong effect! • Capacity misses: increase cache size • Cache design space • total size, block size, associativity (replacement policy) • write-hit policy (write-through, write-back) • write-miss policy (write allocate, write buffers)

Improving Cache Performance Reduce the miss rate • bigger cache • associative cache • larger blocks (16 to 64 bytes) • use a victim cache – a small buffer that holds the most recently discarded blocks Reduce the miss penalty • smaller blocks • for large blocks fetch critical word first • use a write buffer • check write buffer (and/or victim cache) on read miss – may get lucky • use multiple cache levels – L2 cache not tied to CPU clock rate • faster backing store/improved memorybandwidth • wider buses • SDRAMs Reduce the hit time • smaller cache • direct mapped cache • smaller blocks • for writes • no write allocate – just write to write buffer • write allocate – write to a delayed write buffer that then writes to the cache

CENG 3420 Computer Design Spring 2011 Lecture 13: Memory Hierarchy