1 / 32

Computer Organization and Architecture Chapter 7 Large and Fast: Exploiting Memory Hierarchy

Computer Organization and Architecture Chapter 7 Large and Fast: Exploiting Memory Hierarchy. Yu-Lun Kuo Computer Sciences and Information Engineering University of Tunghai, Taiwan sscc6991@gmail.com. Major Components of a Computer. Processor. Devices. Control. Input. Memory. Datapath.

Download Presentation

Computer Organization and Architecture Chapter 7 Large and Fast: Exploiting Memory Hierarchy

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Computer Organization and ArchitectureChapter 7 Large and Fast: Exploiting Memory Hierarchy Yu-Lun Kuo Computer Sciences and Information Engineering University of Tunghai, Taiwan sscc6991@gmail.com

  2. Major Components of a Computer Processor Devices Control Input Memory Datapath Output

  3. µProc 55%/year (2X/1.5yr) DRAM 7%/year (2X/10yrs) Processor-Memory Performance Gap “Moore’s Law” Processor-Memory Performance Gap(grows 50%/year)

  4. Introduction • The Principle of Locality • Program access a relatively small portion of the address space at any instant of time. • Two Different Types of Locality • Temporal Locality (Locality in Time) • If an item is referenced, it will tend to be referenced again soon • e.g., loop, subrouting, stack, variable of counting • Spatial Locality (Locality in Space) • If an item is referenced, items whose addresses are close by tend to be referenced soon • e.g., array access, accessed sequentially

  5. Memory Hierarchy • Memory Hierarchy • A structure that uses multiple levels of memories; as the distance form the CPU increase, the size of the memories and the access time both increase • Locality + smaller HW is faster = memory hierarchy • Levels • each smaller, faster, more expensive/byte than level below • Inclusive • data found in top also found in the bottom

  6. Three Primary Technologies • Building Memory Hierarchies • Main Memory • DRAM (Dynamic random access memory) • Caches (closer to the processor) • SRAM (static random access memory) • DRAM vs. SRAM • Speed : DRAM < SRAM • Cost: DRAM < SRAM

  7. Introduction • Cache memory • Made by SRAM (Static RAM) • Small amount of fast and high speed memory • Sits between normal main memory and CPU • May be located on CPU chip or module

  8. Introduction • Cache memory

  9. A Typical Memory Hierarchy c.2008 Split instruction & data primary caches (on-chip SRAM) Multiple interleaved memory banks (off-chip DRAM) L1 Instruction Cache Unified L2 Cache Memory CPU Memory Memory L1 Data Cache RF Memory Multiported register file (part of CPU) Large unified secondary cache (on-chip SRAM)

  10. A Typical Memory Hierarchy • By taking advantage of the principle of locality • Can present the user with as much memory as is available in the cheapest technology • at the speed offered by the fastest technology On-Chip Components Control eDRAM Secondary Memory (Disk) Instr Cache Second Level Cache (SRAM) ITLB Main Memory (DRAM) Datapath Data Cache RegFile DTLB Speed (%cycles): ½’s 1’s 10’s 100’s 1,000’s Size (bytes): 100’s K’s 10K’s M’s G’s to T’s Cost: highest lowest

  11. Inclusive– what is in L1$ is a subset of what is in L2$ is a subset of what is in MM that is a subset of is in SM 4-8 bytes (word) 8-32 bytes (block) 1 to 4 blocks 1,024+ bytes (disk sector = page) Characteristics of Memory Hierarchy Processor Increasing distance from the processor in access time L1$ L2$ Main Memory Secondary Memory (Relative) size of the memory at each level

  12. Memory Hierarchy List • Registers • L1 Cache • L2 Cache • L3 cache • Main memory • Disk cache • Disk (RAID) • Optical (DVD) • Tape

  13. Why IC and DC need?

  14. Lower Level Memory Upper Level Memory To Processor Blk X From Processor Blk Y The Memory Hierarchy: Terminology • Hit: data is in some block in the upper level (Blk X) • Hit Rate: the fraction of memory accesses found in the upper level • Hit Time: Time to access the upper level which consists of RAM access time + Time to determine hit/miss

  15. Lower Level Memory Upper Level Memory To Processor Blk X From Processor Blk Y The Memory Hierarchy: Terminology • Miss: data is not in the upper level so needs to be retrieve from a block in the lower level (Blk Y) • Miss Rate = 1 - (Hit Rate) • Miss Penalty • Time to replace a block in the upper level + Time to deliver the block the processor • Hit Time << Miss Penalty

  16. How is the Hierarchy Managed? • registers  memory • by compiler (programmer?) • cache  main memory • by the cache controller hardware • main memory  disks • by the operating system (virtual memory) • virtual to physical address mapping assisted by the hardware (TLB) • by the programmer (files)

  17. 7.2 The basics of Caches • Simple cache • The processor requests are each one word • The block size is one word of data • Two questions to answer (in hardware): • Q1: How do we know if a data item is in the cache? • Q2: If it is, how do we find it?

  18. Caches • Direct Mapped • Assign the cache location based on the address of the word in memory • Address mapping: (block address) modulo (# of blocks in the cache) • First consider block sizes of one word

  19. Direct Mapped (Mapping) Cache

  20. Caches • Tag • Contain the address information required to identify whether a word in the cache corresponds to the requested word • Valid bit • After executing many instructions, some of the cache entries may still be empty • Indicate whether an entry contains a valid address • If valid bit = 0, there cannot be a match for this block

  21. 01 4 11 15 Direct Mapped Cache • Consider the main memory word reference string 0 1 2 3 4 3 4 15 Start with an empty cache - all blocks initially marked as not valid 0 miss 1 miss 2 miss 3 miss 00 Mem(0) 00 Mem(1) 00 Mem(2) 00 Mem(0) 00 Mem(1) 00 Mem(0) 00 Mem(0) 00 Mem(1) 00 Mem(2) 00 Mem(3) miss 3 hit 4 hit 15 miss 4 01 Mem(4) 00 Mem(1) 00 Mem(2) 00 Mem(3) 01 Mem(4) 00 Mem(1) 00 Mem(2) 00 Mem(3) 01 Mem(4) 00 Mem(1) 00 Mem(2) 00 Mem(3) 00 Mem(0) 00 Mem(1) 00 Mem(2) 00 Mem(3) • 8 requests, 6 misses

  22. Hits vs. Misses • Read hits • this is what we want! • Read misses • stall the CPU, fetch block from memory, deliver to cache, restart • Write hits • can replace data in cache and memory (write-through) • write the data only into the cache (write-back the cache later) • Write misses • read the entire block into the cache, then write the word

  23. What happens on a write? • Write work somewhat differently • Suppose on a store instruction • Write the data into only the data cache • Memory would have different value • The cache & memory are “inconsistent” • Keep the main memory & cache • Always write the data into both the memory and the cache • Called write-through(直接寫入)

  24. What happens on a write? • Although this design handles writes simple • Not provide very good performance • Every write causes the data to be written to main memory • Take a long time • Ex. 10% of the instructions are stores CPI without cache miss: 1.0 spending 100 extra cycles on every write CPI = 1.0 + 100 x 10% = 11 reducing performance

  25. Cache Processor DRAM Write Buffer Write Buffer for Write Through • A Write Buffer is needed between the Cache and Memory (TLB: Translation Lookaside Buffer 轉譯旁觀緩衝區) • A queue that holds data while the data are waiting to be written to memory • Processor: • writes data into the cache and the write buffer • Memory controller: • write contents of the buffer to memory

  26. What happens on a write? • Write back (間接寫入) • New value only written only to the block in the cache • The modified block is written to the lower level of the hierarchy when it is replaced

  27. What happens on a write? • Write Through • All writes go to main memory as well as cache • Multiple CPUs can monitor main memory traffic to keep local (to CPU) cache up to date • Lots of traffic • Slows down writes • Write Back • Updates initially made in cache only • Update bit for cache slot is set when update occurs • If block is to be replaced, write to main memory only if update bit is set • Other caches get out of sync

  28. Memory System to Support Caches • It is difficult to reduce the latency to fetch the first word from memory • We can reduce the miss penalty if increase the bandwidth from the memory to the cache CPU CPU CPU Multiplexor Cache Cache Cache bus bus bus Memory Memory bank 0 Memory bank 1 Memory bank 2 Memory bank 3 Memory

  29. One-word-wide memory organization • Assume • A cache block for 4 words • 1 memory bus clock cycle to send the address • 15 clock cycles for DRAM access initiated • 1 memory bus clock cycle to return a word of data • Miss penalty: 1+ 4x15 + 4x1 = 65 clock cycles • Number of bytes transferred per bus clock cycle for a single miss • 4 x 4 / 65 = 0.25 CPU Cache bus Memory

  30. Wide memory organization • Assume • A cache block for 4 words • 1 memory bus clock cycle to send the address • 15 clock cycles for DRAM access initiated • 1 memory bus clock cycle to return a word of data • Two word wide • 1 + 2 x 15 + 2 x 1 = 33 clock cycles • 4 x 4 / 33 = 0.48 • Four word wide • 1 + 1 x 15 + 1 x 1 = 17 clock cycles • 4 x 4 / 17 = 0.94 CPU Multiplexor Cache bus Memory

  31. Interleaved memory organization • Assume • A cache block for 4 words • 1 memory bus clock cycle to send the address • 15 clock cycles for DRAM access initiated • 1 memory bus clock cycle to return a word of data • Each memory bank: 1 word wide • Advance: One latency time • 1 + 1 x 15 + 4 x 1 = 20 clock cycle • 4 x 4 / 20 = 0.8 byte/clock • 3 times for one-word-wide CPU Cache bus Memory bank 0 Memory bank 3 Memory bank 1 Memory bank 2

  32. Q & A

More Related