1 / 38

Cache Memory

Cache Memory. Outline. Cache mountain Matrix multiplication Suggested Reading: 6.6, 6.7. 6.6 Putting it Together: The Impact of Caches on Program Performance 6.6.1 The Memory Mountain. The Memory Mountain P512. Read throughput (read bandwidth)

ulema
Download Presentation

Cache Memory

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Cache Memory

  2. Outline • Cache mountain • Matrix multiplication • Suggested Reading: 6.6, 6.7

  3. 6.6 Putting it Together: The Impact of Caches on Program Performance 6.6.1 The Memory Mountain

  4. The Memory Mountain P512 • Read throughput (read bandwidth) • The rate that a program reads data from the memory system • Memory mountain • A two-dimensional function of read bandwidth versus temporal and spatial locality • Characterizes the capabilities of the memory system for each computer

  5. Memory mountain main routineFigure 6.41 P513 /* mountain.c - Generate the memory mountain. */ #define MINBYTES (1 << 10) /* Working set size ranges from 1 KB */ #define MAXBYTES (1 << 23) /* ... up to 8 MB */ #define MAXSTRIDE 16 /* Strides range from 1 to 16 */ #define MAXELEMS MAXBYTES/sizeof(int) int data[MAXELEMS]; /* The array we'll be traversing */

  6. Memory mountain main routine int main() { int size; /* Working set size (in bytes) */ int stride; /* Stride (in array elements) */ double Mhz; /* Clock frequency */ init_data(data, MAXELEMS); /* Initialize each element in data to 1 */ Mhz = mhz(0); /* Estimate the clock frequency */

  7. Memory mountain main routine for (size = MAXBYTES; size >= MINBYTES; size >>= 1) { for (stride = 1; stride <= MAXSTRIDE; stride++) printf("%.1f\t", run(size, stride, Mhz)); printf("\n"); } exit(0); }

  8. Memory mountain test functionFigure 6.40 P512 /* The test function */ void test (int elems, int stride) { int i, result = 0; volatile int sink; for (i = 0; i < elems; i += stride) result += data[i]; sink = result; /* So compiler doesn't optimize away the loop */ }

  9. Memory mountain test function /* Run test (elems, stride) and return read throughput (MB/s) */ double run (int size, int stride, double Mhz) { double cycles; int elems = size / sizeof(int); test (elems, stride); /* warm up the cache */ cycles = fcyc2(test, elems, stride, 0); /* call test (elems,stride) */ return (size / stride) / (cycles / Mhz); /* convert cycles to MB/s */ }

  10. The Memory Mountain • Data • Size • MAXBYTES(8M) bytes or MAXELEMS(2M) words • Partially accessed • Working set: from 8MB to 1KB • Stride: from 1 to 16

  11. The Memory MountainFigure 6.42 P514

  12. Ridges of temporal locality • Slice through the memory mountain with stride=1 • illuminates read throughputs of different caches and memory Ridges: 山脊

  13. Ridges of temporal localityFigure 6.43 P515

  14. A slope of spatial locality • Slice through memory mountain with size=256KB • shows cache block size.

  15. A slope of spatial localityFigure 6.44 P516

  16. 6.6 Putting it Together: The Impact of Caches on Program Performance 6.6.2 Rearranging Loops to Increase Spatial Locality

  17. Matrix Multiplication P517

  18. Matrix Multiplication ImplementationFigure 6.45 (a) P518 /* ijk */ for (i=0; i<n; i++) { for (j=0; j<n; j++) { c[i][j] = 0.0; for (k=0; k<n; k++) c[i][j] += a[i][k] * b[k][j]; } } O(n3)adds and multiplies Each n2 elements of A and B is read n times

  19. Matrix Multiplication P517 • Assumptions: • Each array is an nn array of double, with size 8 • There is a single cache with a 32-byte block size ( B=32 ) • The array size n is so large that a single matrix row does not fit in the L1 cache • The compiler stores local variables in registers, and thus references to local variables inside loops do not require any load and store instructions.

  20. Matrix MultiplicationFigure 6.45 (a) P518 /* ijk */ for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; } } Variable sum held in register

  21. Column- wise Fixed Matrix multiplication (ijk) • Misses per Inner Loop Iteration: ABC 0.25 1.0 0.0 /* ijk */ for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; } } Inner loop: (*,j) (i,j) (i,*) A B C Row-wise 1) (AB) Figure 6.46 P519

  22. Row-wise Column- wise Fixed Matrix multiplication (jik)Figure 6.45 (b) P518 /* jik */ for (j=0; j<n; j++) { for (i=0; i<n; i++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum } } Inner loop: (*,j) (i,j) (i,*) A B C • Misses per Inner Loop Iteration: ABC 0.25 1.0 0.0 1) (AB) Figure 6.46 P519

  23. Row-wise Row-wise Fixed Matrix multiplication (kij)Figure 6.45 (e) P518 /* kij */ for (k=0; k<n; k++) { for (i=0; i<n; i++) { r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; } } Inner loop: (i,k) (k,*) (i,*) A B C 3) (BC) • Misses per Inner Loop Iteration: ABC 0.0 0.25 0.25 Figure 6.46 P519

  24. Row-wise Row-wise Fixed Matrix multiplication (ikj)Figure 6.45 (f) P518 /* ikj */ for (i=0; i<n; i++) { for (k=0; k<n; k++) { r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; } } Inner loop: (i,k) (k,*) (i,*) A B C 3) (BC) • Misses per Inner Loop Iteration: ABC 0.0 0.25 0.25 Figure 6.46 P519

  25. Column - wise Column- wise Fixed Matrix multiplication (jki)Figure 6.45 (c) P518 /* jki */ for (j=0; j<n; j++) { for (k=0; k<n; k++) { r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; } } Inner loop: (*,k) (*,j) (k,j) A B C 2) (AC) • Misses per Inner Loop Iteration: ABC 1.0 0.0 1.0 Figure 6.46 P519

  26. Column- wise Column- wise Fixed Matrix multiplication (kji)Figure 6.45 (d) P518 /* kji */ for (k=0; k<n; k++) { for (j=0; j<n; j++) { r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; } } Inner loop: (*,k) (*,j) (k,j) A B C 2) (AC) • Misses per Inner Loop Iteration: ABC 1.0 0.0 1.0 Figure 6.46 P519

  27. Pentium matrix multiply performanceFigure 6.47 (d) P519 2) (AC) 3) (BC) 1) (AB)

  28. Pentium matrix multiply performance • Notice that miss rates are helpful but not perfect predictors. • Code scheduling matters, too.

  29. Summary of matrix multiplication • ijk (& jik): • 2 loads, 0 stores • misses/iter = 1.25 • kij (& ikj): • 2 loads, 1 store • misses/iter = 0.5 • jki (& kji): • 2 loads, 1 store • misses/iter = 2.0 for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; } } for (k=0; k<n; k++) { for (i=0; i<n; i++) { r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; } } for (j=0; j<n; j++) { for (k=0; k<n; k++) { r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; } } 1) (AB) 3) (BC) 2) (AC)

  30. 6.6 Putting it Together: The Impact of Caches on Program Performance 6.6.3 Using Blocking to Increase Temporal Locality

  31. Improving temporal locality by blocking P520 • Example: Blocked matrix multiplication • “block” (in this context) does not mean “cache block”. • Instead, it mean a sub-block within the matrix. • Example: N = 8; sub-block size = 4

  32. A11 A12 A21 A22 B11 B12 B21 B22 C11 C12 C21 C22 = X Key idea: Sub-blocks (i.e., Axy) can be treated just like scalars. C11 = A11B11 + A12B21 C12 = A11B12 + A12B22 C21 = A21B11 + A22B21 C22 = A21B12 + A22B22 Improving temporal locality by blocking

  33. Blocked matrix multiply (bijk)Figure 6.48 P521 for (jj=0; jj<n; jj+=bsize) { for (i=0; i<n; i++) for (j=jj; j < min(jj+bsize,n); j++) c[i][j] = 0.0; for (kk=0; kk<n; kk+=bsize) { for (i=0; i<n; i++) { for (j=jj; j < min(jj+bsize,n); j++) { sum = 0.0 for (k=kk; k < min(kk+bsize,n); k++) { sum += a[i][k] * b[k][j]; } c[i][j] += sum; } } } }

  34. Blocked matrix multiply analysis • Innermost loop pair multiplies a 1 X bsize sliver of A by a bsize X bsize block of B and accumulates into 1 X bsize sliver of C • Loop over i steps through n row slivers of A & C, using same B Sliver: 长条

  35. for (i=0; i<n; i++) { for (j=jj; j < min(jj+bsize,n); j++) { sum = 0.0 for (k=kk; k < min(kk+bsize,n); k++) { sum += a[i][k] * b[k][j]; } c[i][j] += sum; } Innermost Loop Pair kk jj jj kk i i A B C Update successive elements of sliver row sliver accessed bsize times block reused n times in succession Blocked matrix multiply analysis Figure 6.49 P522

  36. Pentium blocked matrix multiply performanceFigure 6.50 P523 2) 3) 1)

  37. 6.7 Putting it Together: Exploring Locality in Your Programs

  38. Techniques P523 • Focus your attention on the inner loops • Try to maximize the spatial locality in your programs by reading data objects sequentially, in the order they are stored in memory • Try to maximize the temporal locality in your programs by using a data object as often as possible once it has been read from memory • Miss rates, the number of memory accesses

More Related