1 / 51

CMPUT229 - Fall 2003

CMPUT229 - Fall 2003. Topic D: The Memory Hierarchy José Nelson Amaral. Bryant , Randal E., O’Hallaron , David, Computer Systems: A Programmer’s Perspective , Prentice Hall, 2003. (B&H). Reading Assignment. Chapter 6: The Memory Hierarchy. Types of Memories.

keilah
Download Presentation

CMPUT229 - Fall 2003

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CMPUT229 - Fall 2003 Topic D: The Memory Hierarchy José Nelson Amaral CMPUT 229 - Computer Organization and Architecture I

  2. Bryant, Randal E., O’Hallaron, David, Computer Systems: A Programmer’s Perspective, Prentice Hall, 2003. (B&H) Reading Assignment Chapter 6: The Memory Hierarchy CMPUT 229 - Computer Organization and Architecture I

  3. Types of Memories Read/Write Memory (RWM): we can store and retrieve data. the time required to read or write a bit of memory is independent of the bit’s location. Random Access Memory (RAM): once a word is written to a location, it remains stored as long as power is applied to the chip, unless the location is written again. Static Random Access Memory (SRAM): the data stored at each location must be refreshed periodically by reading it and then writing it back again, or else it disappears. Dynamic Random Access Memory (DRAM): CMPUT 229 - Computer Organization and Architecture I

  4. 0 1 2 3 4 5 6 7 DIN2 DIN0 DIN3 DIN1 IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR 3-to-8 decoder IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR 0 1 1 A2 A1 A0 2 1 0 IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR WR_L WE_L CS_L IOE_L OE_L DOUT3 DOUT2 DOUT1 DOUT0

  5. 0 1 2 3 4 5 6 7 DIN3 DIN3 DIN3 DIN3 IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR 3-to-8 decoder IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR 0 1 1 A2 A1 A0 2 1 0 IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR WR_L WE_L CS_L IOE_L OE_L DOUT3 DOUT3 DOUT3 DOUT3

  6. 0 1 2 3 4 5 6 7 DIN3 DIN3 DIN3 DIN3 IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR 3-to-8 decoder IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR 0 1 1 A2 A1 A0 2 1 0 IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR WR_L WE_L CS_L IOE_L OE_L DOUT3 DOUT3 DOUT3 DOUT3

  7. 0 1 2 3 4 5 6 7 DIN3 DIN3 DIN3 DIN3 IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR 3-to-8 decoder IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR 0 1 1 A2 A1 A0 2 1 0 IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR WR_L WE_L CS_L IOE_L OE_L DOUT3 DOUT3 DOUT3 DOUT3

  8. 1 written refreshes Vcap VCC HIGH LOW 0V time 0 stored Refreshing the Memory The solution is to periodically refresh the memory cells by reading and writing back each one of them. CMPUT 229 - Computer Organization and Architecture I

  9. SRAM with Bi-directional Data Bus microprocessor IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR IN OUT SEL WR WR_L WE_L CS_L IOE_L OE_L DIO3 DIO2 DIO1 DIO0 CMPUT 229 - Computer Organization and Architecture I

  10. DRAM High Level View DRAM chip Cols 0 1 2 3 Memory controller 0 2 / addr 1 Rows 2 (to CPU) 3 8 / data Internal row buffer CMPUT 229 - Computer Organization and Architecture I Byant/O’Hallaron, pp. 459

  11. DRAM chip Cols 0 Memory controller 1 2 3 RAS = 2 2 / 0 addr 1 Rows 2 3 8 / data Row 2 Internal row buffer DRAM RAS Request CMPUT 229 - Computer Organization and Architecture I RAS = Row Address Strobe Byant/O’Hallaron, pp. 460

  12. DRAM CAS Request DRAM chip Cols 0 Memory controller 1 2 3 CAS = 1 2 / 0 addr 1 Rows 2 Supercell (2,1) 3 8 / data Internal row buffer CMPUT 229 - Computer Organization and Architecture I CAS = Column Address Strobe Byant/O’Hallaron, pp. 460

  13. addr (row = i, col = j) : Supercell (i,j) DRAM 0 64 MB memory module consisting of 8 8Mx8 DRAMs DRAM 7 data bits 56-63 bits 48-55 bits 40-47 bits 32-39 bits 24-31 bits 16-23 bits 8-15 bits 0-7 63 56 55 48 47 40 39 32 31 24 23 16 15 8 7 0 Memory controller 64-bit double word at main memory address A 64-bit doubleword to CPU chip Memory Modules Byant/O’Hallaron, pp. 461

  14. Step 1: Apply row address Step 2: RAS go from high to low and remain low 2 8 Step 3: Apply column address 5 Step 4: WE must be high Step 5: CAS goes from high to low and remain low 3 1 Step 6: OE goes low 4 Step 7: Data appears 6 Step 8: RAS and CAS return to high 7 Read Cycle on an Asynchronous DRAM

  15. Improved DRAMs Central Idea: Each read to a DRAM actually reads a complete row of bits or word line from the DRAM core into an array of sense amps. A traditional asynchronous DRAM interface then selects a small number of these bits to be delivered to the cache/microprocessor. All the other bits already extracted from the DRAM cells into the sense amps are wasted. CMPUT 229 - Computer Organization and Architecture I

  16. Fast Page Mode DRAMs In a DRAM with Fast Page Mode, a page is defined as all memory addresses that have the same row address. To read in fast page mode, all the steps from 1 to 7 of a standard read cycle are performed. Then OE and CAS are switched high, but RAS remains low. Then the steps 3 to 7 (providing a new column address, asserting CAS and OE) are performed for each new memory location to be read. CMPUT 229 - Computer Organization and Architecture I

  17. A Fast Page Mode Read Cycle on an Asynchronous DRAM

  18. Enhanced Data Output RAMs (EDO-RAM) The process to read multiple locations in an EDO-RAM is very similar to the Fast Page Mode. The difference is that the output drivers are not disabled when CAS goes high. This distintion allows the data from the current read cycle to be present at the outputs while the next cycle begins. As a result, faster read cycle times are allowed. CMPUT 229 - Computer Organization and Architecture I

  19. An Enhanced Data Output Read Cycle on an Asynchronous DRAM

  20. Synchronous DRAMs (SDRAM) A Synchronous DRAM (SDRAM) has a clock input. It operates in a similar fashion as the fast page mode and EDO DRAM. However the consecutive data is output synchronously on the falling/rising edge of the clock, instead of on command by CAS. How many data elements will be output (the length of the burst) is programmable up to the maximum size of the row. The clock in an SDRAM typically runs one order of magnitude faster than the access time for individual accesses. CMPUT 229 - Computer Organization and Architecture I

  21. DDR SDRAM A Double Data Rate (DDR) SDRAM is an SDRAM that allows data transfers both on the rising and falling edge of the clock. Thus the effective data transfer rate of a DDR SDRAM is two times the data transfer rate of a standard SDRAM with the same clock frequency. CMPUT 229 - Computer Organization and Architecture I

  22. The Rambus DRAM (RDRAM) Multiple memory arrays (banks) Rambus DRAMs are synchronous and transfer data on both edges of the clock. CMPUT 229 - Computer Organization and Architecture I

  23. SDRAM Memory Systems Complex circuits for RAS/CAS/OE. Each DIMM is connected in parallel with the memory controller. (DIMM = Dual In-line Memory Module) Often requires buffering. Needs the whole clock cycle to establish valid data. Making the bus wider is mechanically complicated. CMPUT 229 - Computer Organization and Architecture I

  24. RDRAM Memory Systems CMPUT 229 - Computer Organization and Architecture I

  25. Bus Structure Register file CPU ALU System bus Memory bus Main memory Bus interface I/O bridge I/O bus Expansion slots for other devices such as network adapters USB controller Graphics adapter Disk controller Mouse Keyboard Monitor CMPUT 229 - Computer Organization and Architecture I Disk Byant/O’Hallaron, pp. 472

  26. DMA Request Register file CPU ALU DMA = Direct Memory Access System bus Memory bus Main memory Bus interface I/O bridge I/O bus Expansion slots for other devices such as network adapters USB controller Graphics adapter Disk controller Mouse Keyboard Monitor CMPUT 229 - Computer Organization and Architecture I Disk Byant/O’Hallaron, pp. 473

  27. DMA Transfer Register file CPU ALU DMA = Direct Memory Access System bus Memory bus Main memory Bus interface I/O bridge I/O bus Expansion slots for other devices such as network adapters USB controller Graphics adapter Disk controller Mouse Keyboard Monitor CMPUT 229 - Computer Organization and Architecture I Disk Byant/O’Hallaron, pp. 473

  28. Interrupt DMA Complet. Notification Register file CPU ALU DMA = Direct Memory Access Memory bus Main memory Bus interface I/O bridge I/O bus Expansion slots for other devices such as network adapters USB controller Graphics adapter Disk controller Mouse Keyboard Monitor CMPUT 229 - Computer Organization and Architecture I Disk Byant/O’Hallaron, pp. 474

  29. Locality We say that a computer program exhibits good locality if the program tends to reference data that is nearby or data that has been referenced recently. Because a program might do one of these things, but not the other, the principle of locality is separated into two flavors: Temporal locality: a memory location that is referenced once is likely to be referenced multiple times in the near future. Spatial locality: if a memory location that is referenced once then locations that are nearby are likely to be referenced in the near future. CMPUT 229 - Computer Organization and Architecture I Byant/O’Hallaron, pp. 478

  30. Examples In the Sampler function below, RandInt returns a randomly selected integer within the specified interval. Which program has better locality? 1 intSampler(int v[], int N, int K) 2 { 3 int i, j; 4 int sum = 0; 5 6 for (i=0 ; i<K ; i=i+1) 7 { 8 j = RandInt(0,N-1); 9 sum += v[j]; 10 } 11 return sum/K; 12 } 1 intSumVec(int v[], int N) 2 { 3 int i; 4 int sum = 0; 5 6 for (i=0 ; i<N ; i=i+1) 7 sum += v[i]; 8 return sum; 9 } CMPUT 229 - Computer Organization and Architecture I Byant/O’Hallaron, pp. 479

  31. L1 cache holds cache lines retrieved from the L2 cache. CPU registers hold words retrieved from cache memory. L0: Registers L2 cache holds cache lines retrieved from memory. On-chip L1 cache (SRAM) L1: Off-chip L2 cache (SRAM) L2: Main memory holds disk blocks retrieved from local disks. Main memory (DRAM) L3: Local disks hold files retrieved from disks on remote network servers. Local secondary storage (local disks) L4: Remote secondary storage (distributed file systems, Web servers) L5: Memory Hierarchy Smaller, faster, and costlier (per byte) storage devices Larger, slower, and cheaper (per byte) storage devices Byant/ O’Hallaron, pp. 483

  32. Smaller, faster, more expensive device at level k caches a subset of the blocks from level k+1 Level k: 4 9 14 3 Data is copied between levels in block-sized transfer units 0 1 2 3 Larger, slower, cheaper storage device at level k+1 is partitioned into blocks. 4 5 6 7 Level k+1: 8 9 10 11 12 13 14 15 Caching Principle CMPUT 229 - Computer Organization and Architecture I Byant/O’Hallaron, pp. 484

  33. Cache Misses Cold Misses, or compulsory misses, occur the first time that a data is referenced. Conflict Misses, occur when two memory references have to occupy the same memory line. It can occur even when the remainder of the cache is not in use. Capacity Misses, occur when there are no more free lines in the cache. CMPUT 229 - Computer Organization and Architecture I

  34. L1 and L2 Bus System CPU chip Register file ALU L1 cache Cache bus System bus Memory bus Main memory L2 cache Bus interface I/O bridge CMPUT 229 - Computer Organization and Architecture I Byant/O’Hallaron, pp. 488

  35. t tag bits per line 1 valid bit per line B = 2b bytes per cache block Valid Tag 0 1 • • • B–1 • • • E lines per set Set 0: Valid Tag 0 1 • • • B–1 Valid Tag 0 1 • • • B–1 • • • Set 1: S = 2s sets Valid Tag 0 1 • • • B–1 • • • Valid Tag 0 1 • • • B–1 • • • Set S -1: Valid Tag 0 1 • • • B–1 Cache size: C = B x E x S data bytes Cache Organization Byant/O’Hallaron, pp. 488

  36. t bits s bits b bits Address: m-1 0 Tag Set index Block offset Address Partition Selects which word, inside the block, is referenced. Compared with tags in the cache to find a match. Used to find the set where the data might be found in the cache. CMPUT 229 - Computer Organization and Architecture I Byant/O’Hallaron, pp. 488

  37. Multi-Level Cache Organization CPU Main memory L1 d-cache L2 unified cache Regs Disk L1 i-cache CMPUT 229 - Computer Organization and Architecture I Byant/O’Hallaron, pp. 504

  38. Writing Cache-Conscious Programs Problem: Write C code for a function that computes the sum of the elements of a two dimensional array, a[M][N], of integers. intSumArray(int a[][], int M, int N) 1 intSumArrayCols(int a[][], int M, int N) 2 { 3 inti, j; 4 intsum = 0; 5 6 for (j=0 ; j<N ; i++) 7 for (i=0 ; i<M ; i++) 8 sum += a[i][j]; 8 return sum; 9 } 1 intSumArrayRows(int a[][], int M, int N) 2 { 3 inti, j; 4 intsum = 0; 5 6 for (i=0 ; i<M ; i++) 7 for (j=0 ; j<N ; j++) 8 sum += a[i][j]; 8 return sum; 9 } CMPUT 229 - Computer Organization and Architecture I Byant/O’Hallaron, pp. 508

  39. SumArrayRows Data Access Order 0x8000 4000 a[0][0] 0x8000 4004 a[0][1] 0x8000 4008 a[0][2] 1 intSumArrayRows(int a[][], int M, int N) 2 { 3 inti, j; 4 int sum = 0; 5 6 for (i=0 ; i<M ; i++) 7 for (j=0 ; j<N ; j++) 8 sum += a[i][j]; 8 return sum; 9 } 0x8000 400C a[0][3] 0x8000 4010 a[0][4] 0x8000 4014 a[0][5] 0x8000 4018 a[1][0] 0x8000 401C a[1][1] 0x8000 4020 a[1][2] 0x8000 4024 a[1][3] 0x8000 4028 a[1][4] 0x8000 402C a[1][5] 0x8000 4030 a[2][0] 0x8000 4034 a[2][1] 0x8000 4038 a[2]2] 0x8000 403C a[2][3] 0x8000 4040 a[2][4] 0x8000 4044 a[2][5] 0x8000 4048 a[3][0] 0x8000 404C a[3][1] 0x8000 4050 a[3][2] 0x8000 4054 a[3][3] 0x8000 4058 a[3][4] ••• CMPUT 229 - Computer Organization and Architecture I ••• Byant/O’Hallaron, pp. 508

  40. SumArrayRows Data Access Order 0x8000 4000 a[0][0] 0x8000 4004 a[0][1] 0x8000 4008 a[0][2] 1 intSumArrayRows(int a[][], int M, int N) 2 { 3 inti, j; 4 int sum = 0; 5 6 for (i=0 ; i<M ; i++) 7 for (j=0 ; j<N ; j++) 8 sum += a[i][j]; 8 return sum; 9 } 0x8000 400C a[0][3] 0x8000 4010 a[0][4] 0x8000 4014 a[0][5] 0x8000 4018 a[1][0] 0x8000 401C a[1][1] 0x8000 4020 a[1][2] 0x8000 4024 a[1][3] 0x8000 4028 a[1][4] 0x8000 402C a[1][5] 0x8000 4030 a[2][0] 0x8000 4034 a[2][1] 0x8000 4038 a[2]2] 0x8000 403C a[2][3] 0x8000 4040 a[2][4] 0x8000 4044 a[2][5] 0x8000 4048 a[3][0] 0x8000 404C a[3][1] 0x8000 4050 a[3][2] 0x8000 4054 a[3][3] 0x8000 4058 a[3][4] ••• CMPUT 229 - Computer Organization and Architecture I ••• Byant/O’Hallaron, pp. 508

  41. SumArrayRows Data Access Order 0x8000 4000 a[0][0] 0x8000 4004 a[0][1] 0x8000 4008 a[0][2] 1 intSumArrayRows(int a[][], int M, int N) 2 { 3 inti, j; 4 intsum = 0; 5 6 for (i=0 ; i<M ; i++) 7 for (j=0 ; j<N ; j++) 8 sum += a[i][j]; 8 return sum; 9 } 0x8000 400C a[0][3] 0x8000 4010 a[0][4] 0x8000 4014 a[0][5] 0x8000 4018 a[1][0] 0x8000 401C a[1][1] 0x8000 4020 a[1][2] 0x8000 4024 a[1][3] 0x8000 4028 a[1][4] 0x8000 402C a[1][5] 0x8000 4030 a[2][0] 0x8000 4034 a[2][1] 0x8000 4038 a[2]2] 0x8000 403C a[2][3] 0x8000 4040 a[2][4] 0x8000 4044 a[2][5] 0x8000 4048 a[3][0] 0x8000 404C a[3][1] 0x8000 4050 a[3][2] 0x8000 4054 a[3][3] 0x8000 4058 a[3][4] ••• CMPUT 229 - Computer Organization and Architecture I ••• Byant/O’Hallaron, pp. 508

  42. SumArrayRows Data Access Order 0x8000 4000 a[0][0] 0x8000 4004 a[0][1] 0x8000 4008 a[0][2] 1 intSumArrayRows(int a[][], int M, int N) 2 { 3 inti, j; 4 int sum = 0; 5 6 for (i=0 ; i<M ; i++) 7 for (j=0 ; j<N ; j++) 8 sum += a[i][j]; 8 return sum; 9 } 0x8000 400C a[0][3] 0x8000 4010 a[0][4] 0x8000 4014 a[0][5] 0x8000 4018 a[1][0] 0x8000 401C a[1][1] 0x8000 4020 a[1][2] 0x8000 4024 a[1][3] 0x8000 4028 a[1][4] 0x8000 402C a[1][5] 0x8000 4030 a[2][0] 0x8000 4034 a[2][1] 0x8000 4038 a[2]2] 0x8000 403C a[2][3] 0x8000 4040 a[2][4] 0x8000 4044 a[2][5] 0x8000 4048 a[3][0] 0x8000 404C a[3][1] 0x8000 4050 a[3][2] 0x8000 4054 a[3][3] 0x8000 4058 a[3][4] ••• CMPUT 229 - Computer Organization and Architecture I ••• Byant/O’Hallaron, pp. 508

  43. SumArrayCols Data Access Order 0x8000 4000 a[0][0] 0x8000 4004 a[0][1] 0x8000 4008 a[0][2] 0x8000 400C a[0][3] 0x8000 4010 a[0][4] 0x8000 4014 a[0][5] 0x8000 4018 a[1][0] 1 intSumArrayCols(int a[][], int M, int N) 2 { 3 inti, j; 4 int sum = 0; 5 6 for (j=0 ; j<N ; i++) 7 for (i=0 ; i<M ; i++) 8 sum += a[i][j]; 8 return sum; 9 } 0x8000 401C a[1][1] 0x8000 4020 a[1][2] 0x8000 4024 a[1][3] 0x8000 4028 a[1][4] 0x8000 402C a[1][5] 0x8000 4030 a[2][0] 0x8000 4034 a[2][1] 0x8000 4038 a[2]2] 0x8000 403C a[2][3] 0x8000 4040 a[2][4] 0x8000 4044 a[2][5] 0x8000 4048 a[3][0] 0x8000 404C a[3][1] 0x8000 4050 a[3][2] 0x8000 4054 a[3][3] 0x8000 4058 a[3][4] ••• CMPUT 229 - Computer Organization and Architecture I ••• Byant/O’Hallaron, pp. 508

  44. SumArrayCols Data Access Order 0x8000 4000 a[0][0] 0x8000 4004 a[0][1] 0x8000 4008 a[0][2] 0x8000 400C a[0][3] 0x8000 4010 a[0][4] 0x8000 4014 a[0][5] 0x8000 4018 a[1][0] 1 intSumArrayCols(int a[][], int M, int N) 2 { 3 inti, j; 4 int sum = 0; 5 6 for (j=0 ; j<N ; i++) 7 for (i=0 ; i<M ; i++) 8 sum += a[i][j]; 8 return sum; 9 } 0x8000 401C a[1][1] 0x8000 4020 a[1][2] 0x8000 4024 a[1][3] 0x8000 4028 a[1][4] 0x8000 402C a[1][5] 0x8000 4030 a[2][0] 0x8000 4034 a[2][1] 0x8000 4038 a[2]2] 0x8000 403C a[2][3] 0x8000 4040 a[2][4] 0x8000 4044 a[2][5] 0x8000 4048 a[3][0] 0x8000 404C a[3][1] 0x8000 4050 a[3][2] 0x8000 4054 a[3][3] 0x8000 4058 a[3][4] ••• CMPUT 229 - Computer Organization and Architecture I ••• Byant/O’Hallaron, pp. 508

  45. SumArrayCols Data Access Order 0x8000 4000 a[0][0] 0x8000 4004 a[0][1] 0x8000 4008 a[0][2] 0x8000 400C a[0][3] 0x8000 4010 a[0][4] 0x8000 4014 a[0][5] 0x8000 4018 a[1][0] 1 intSumArrayCols(int a[][], int M, int N) 2 { 3 inti, j; 4 intsum = 0; 5 6 for (j=0 ; j<N ; i++) 7 for (i=0 ; i<M ; i++) 8 sum += a[i][j]; 8 return sum; 9 } 0x8000 401C a[1][1] 0x8000 4020 a[1][2] 0x8000 4024 a[1][3] 0x8000 4028 a[1][4] 0x8000 402C a[1][5] 0x8000 4030 a[2][0] 0x8000 4034 a[2][1] 0x8000 4038 a[2]2] 0x8000 403C a[2][3] 0x8000 4040 a[2][4] 0x8000 4044 a[2][5] 0x8000 4048 a[3][0] 0x8000 404C a[3][1] 0x8000 4050 a[3][2] 0x8000 4054 a[3][3] 0x8000 4058 a[3][4] ••• CMPUT 229 - Computer Organization and Architecture I ••• Byant/O’Hallaron, pp. 508

  46. SumArrayCols Data Access Order 0x8000 4000 a[0][0] 0x8000 4004 a[0][1] 0x8000 4008 a[0][2] 0x8000 400C a[0][3] 0x8000 4010 a[0][4] 0x8000 4014 a[0][5] 0x8000 4018 a[1][0] 1 intSumArrayCols(int a[][], int M, int N) 2 { 3 inti, j; 4 int sum = 0; 5 6 for (j=0 ; j<N ; i++) 7 for (i=0 ; i<M ; i++) 8 sum += a[i][j]; 8 return sum; 9 } 0x8000 401C a[1][1] 0x8000 4020 a[1][2] 0x8000 4024 a[1][3] 0x8000 4028 a[1][4] 0x8000 402C a[1][5] 0x8000 4030 a[2][0] 0x8000 4034 a[2][1] 0x8000 4038 a[2]2] 0x8000 403C a[2][3] 0x8000 4040 a[2][4] 0x8000 4044 a[2][5] 0x8000 4048 a[3][0] 0x8000 404C a[3][1] 0x8000 4050 a[3][2] 0x8000 4054 a[3][3] 0x8000 4058 a[3][4] ••• CMPUT 229 - Computer Organization and Architecture I ••• Byant/O’Hallaron, pp. 508

  47. Read Bandwidth The rate that a program reads data from the memory system is called the read throughput or the read bandwidth. The read throughput is measured in bytes per second, or more commonly in Mbytes/s. The read throughput of a program depends on the memory hierarchy level from which the data is retrieved. We can write a program to force the data to come from the various levels in the hierarchy to estimate the read throughput. CMPUT 229 - Computer Organization and Architecture I

  48. Measuring Read Bandwidth 1 inttest(int elems, int stride) 2 { 3 inti; 4 intresult = 0; 5 volatile intsink; 6 7 for(i=0 ; i<elems ; i += stride) 8 result += data[i]; 9 sink = result; /* to prevent compiler from optimizing away the loop */ 10 } CMPUT 229 - Computer Organization and Architecture I Byant/O’Hallaron, pp. 508

  49. Pentium III Xeon Memory Mountain Byant/O’Hallaron, pp. 514

  50. Temporal Locality(stride = 1)

More Related