1 / 32

Introduction to CUDA Programming

Introduction to CUDA Programming. Histograms and Sparse Array Multiplication Andreas Moshovos Winter 2009 Based on documents from: NVIDIA & Appendix A of the New P&H Book. Histogram. E.g., Given an image calculate this: Distribution of values. Sequential Algorithm.

orde
Download Presentation

Introduction to CUDA Programming

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to CUDA Programming Histograms and Sparse Array Multiplication Andreas Moshovos Winter 2009 Based on documents from: NVIDIA & Appendix A of the New P&H Book

  2. Histogram • E.g., Given an image calculate this: • Distribution of values

  3. Sequential Algorithm for (int i = 0; i < BIN_COUNT; i++) result[i] = 0; for (int i = 0; i < dataN; i++) result[data[i]]++;

  4. Parallel Strategy • Distribute work across multiple blocks • Divide input data to blocks • Each block process it’s own portion • Multiple threads, one per image pixel? • #pixels / #threads per thread • Produces a partial histogram • Could produce multiple histograms • Merge all partial histograms • Produces the final histogram

  5. Data Structures and Access Patterns result[data[i]]++; • Data[]: • We control: Can be accessed sequentially • Each element accessed only once • Result[]: • Access is data-dependent • Each element may be accessed multiple times • Data[] in global memory • Result[] in shared memory

  6. Sub-Histograms • How many sub-histograms can we fit in shared memory? • Input value range: 0-255, 1 byte • Each histogram needs 256 entries • How many bytes per entry? • That’s data dependent • Let’s assume 32-bits or 4 bytes • 16KB (shared memory) / (256 x 4) (histogram) • 16 sub-histograms at any given point of time • If one per thread then we have less than a warp • Let’s try one histogram per block • many threads per block

  7. Partial Histogram Data Structure • Array in shared memory • One per block • One column per possible pixel value

  8. Algorithm Overview • Step 1: • Initialize partial histogram • Each thread: • s_Hist[index] = 0 • index += threads per block • Step 2: • Update histogram • Each thread: • read data[index] • update s_Hist[] • index += Total number of threads • Step 3: • Update global histogram • Each thread • read s_hist[index] • update global histogram • index += threads per block

  9. Simultaneous Updates? • Threads in a block: • update s_Hist[] • All threads: • update global history • Without special support this becomes: • register X = value of A • X ++ • A = register X • This is a read-modify-write sequence

  10. The problem with simultaneous updates • What if we do each step individually • r10 = mem[100] 10 r100 = mem[100] 10 • r10++ 11 r100++ 11 • mem[100] = r10 11 mem[100] = r100 11 • But we really wanted 12 • What if we had 32 threads running in parallel? • Start with 10 we would want: 10+32 • We may still get 11 • Need to think about this: • Special support: Atomic operations

  11. Atomic Operations • Read-Modify-Write operations that are guaranteed to happen “atomically” • Produces the same result as if the sequence executed in isolation in time • Think of it as “serializing the execution” of all atomics • This is not what happens – This is how you should think about them

  12. Atomic Operations • Supported both in Shared and Global memory • Example: • atomicAdd (pointer, value) • does: *pointer += value • Atomic Operations • Add, Sub, Inc, Dec • Exch, Min, Max, CAS • Bitwise: And, Or, Xor • Work with (unsigned) integers • Exch works with single FP as well

  13. atomicExch, atomicMin, atomicMax, atomicCAS • atomicExch (pointer, value) • tmp = * pointer • *pointer = value • return tmp • atomicMin (pointer, value) (max is similar) • tmp = *pointer • if (*pointer < value) *pointer = value • return tmp • atomicCAS (pointer, value1, value2) • tmp = *pointer • if (*pointer == value1) *pointer = value2 • return tmp

  14. atomicInc, atomicDec • atomicInc (pointer, value) • tmp = *pointer • if (*pointer < value) (*pointer)++ • else *pointer = 0 • return tmp • atomicDec (pointer, value) • tmp = *pointer • if (*pointer == 0 || *pointer > value) *pointer = value • else (*pointer)-- • return tmp

  15. atomicAnd, atomicOr, atomicXOR • atomicAnd (pointer, value) • tmp = *pointer • *pointer = *pointer & value • return tmp • Others similar

  16. CUDA Implementation - Declarations __global__ void histogram256Kernel (uint *d_Result, uint *d_Data, int dataN){ //Current global thread index const int globalTid = blockIdx.x * blockDim.x + threadIdx.x; //Total number of threads in the compute grid const int numThreads = blockDim.x * gridDim.x; __shared__ uint s_Hist[BIN_COUNT];

  17. Clear partial histogram buffer //Clear shared memory buffer for current thread block before processing for ( int pos = threadIdx.x; pos < BIN_COUNT; pos += blockDim.x) s_Hist[pos] = 0; __syncthreads ();

  18. Generate partial histogram for ( int pos = globalTid; pos < dataN; pos += numThreads){ uint data4 = d_Data[pos]; atomicAdd (s_Hist + (data4 >> 0) & 0xFFU, 1); atomicAdd (s_Hist + (data4 >> 8) & 0xFFU, 1); atomicAdd (s_Hist + (data4 >> 16) & 0xFFU, 1); atomicAdd (s_Hist + (data4 >> 24) & 0xFFU, 1); } __syncthreads();

  19. Merge partial histogram with global histogram for (int pos = threadIdx.x; pos < BIN_COUNT; pos += blockDim.x){ atomicAdd(d_Result + pos, s_Hist[pos]); }

  20. Code overview __global__ void histogram256Kernel (uint *d_Result, uint *d_Data, int dataN){ const int globalTid = blockIdx.x * blockDim.x + threadIdx.x; const int numThreads = blockDim.x * gridDim.x; __shared__ uint s_Hist[BIN_COUNT]; for (int pos = threadIdx.x; pos < BIN_COUNT; pos += blockDim.x) s_Hist[pos] = 0; __syncthreads (); for (int pos = globalTid; pos < dataN; pos += numThreads){ uint data4 = d_Data[pos]; atomicAdd (s_Hist + (data4 >> 0) & 0xFFU, 1); atomicAdd (s_Hist + (data4 >> 8) & 0xFFU, 1); atomicAdd (s_Hist + (data4 >> 16) & 0xFFU, 1); atomicAdd (s_Hist + (data4 >> 24) & 0xFFU, 1); } __syncthreads(); for (int pos = threadIdx.x; pos < BIN_COUNT; pos += blockDim.x) atomicAdd(d_Result + pos, s_Hist[pos]); }

  21. Discussion • s_Hist updates • Conflicts in shared memory • Data Dependent • 16-way conflicts possible and likely • Is there an alternative? • One histogram per thread? • Load data in shared memory • Each thread produces a portion of the s_Hist that maps onto the same bank?

  22. Warp Vote Functions • int __all (int predicate); • evaluates predicate for all threads of the warp and returns non-zero if and only if predicate evaluates to non-zero for all of them. • int __any (int predicate); • evaluates predicate for all threads of the warp and returns non-zero if and only if predicate evaluates to non-zero for any of them.

  23. Warp Vote Functions Example • Original code: for (int pos = threadIdx.x; pos < BIN_COUNT; pos += blockDim.x){ atomicAdd(d_Result + pos, s_Hist[pos]); • Modified w/ __any (): for (int pos = threadIdx.x; pos < BIN_COUNT; pos += blockDim.x){ if (__any (s_Hist[pos] != 0) ) atomicAdd(d_Result + pos, s_Hist[pos]); • Modified w/ __all (): for (int pos = threadIdx.x; pos < BIN_COUNT; pos += blockDim.x){ if (!__all (s_Hist[pos] == 0) ) atomicAdd(d_Result + pos, s_Hist[pos]);

  24. Sparse Matrix Multiplication • Sparse Matrix N x N: • number of non-zero entries m is only a small fraction of the total • Representation goal: • store only non-zero entries • Typically: • m = O(N)

  25. Compressed Sparse Row Representation • Av[]: • Array values in row-major order • Aj[]: • Column for corresponding Av[] entry • Ap[]: • row i extends from indexes Ap[i] to Ap[i+1] -1 in Av[] and Aj[]

  26. Matrix x Vector: y = Ax x Ax 2 x 20 + 4 x 30 + 1 x 40

  27. Single Row Multiply • Produces an entry of the result vector float multiply_row (uint rowsize, uint *Aj, // column indices for row float *Av, // nonzero entries for row float *xl // the RHS vector { float sum = 0; for (uint column=0; column<rowsize; column++) sum = Av[column] * x[ Aj[column] ] ; return sum; }

  28. Serial Code void csrmul_serial (uint *Ap, uint *Aj, float *Av, uint num_rows, float *x, float *y) { for (uint row=0; row<num_rows; row++) { uint row_begin = Ap[row]; uint row_end = Ap[row + l]; y[row] = multiply_row ( row_end - row_begin, Aj+row_begin, Av+row_begin, x); } }

  29. CUDA Strategy • Assume that there are many rows • One thread per row

  30. CUDA Kernel __global__ void csrmul_kernel (uint *Ap, uint *Aj, float *Av, uint num_rows, float *x, float *y) uint row = blockIdx.x * blockDim.x + threadIdx.x; uint row_begin = Ap[row]; uint row_end = Ap[row+1]; y[row] = multiply_row (row_end – row_begin, Aj + row_begin, Av + row_begin, x); }

  31. Discussion • We are not using shared memory • all are in global memory • Claim: • rows using x[i] will be rows near row I • Block processing rows i through j • cache x[i] through x[j] • load from shared memory if possible • Unroll multiply_row() • Fetch Ap[row+1] from adjacent thread

  32. CSR multiplication using shared memory __global__ void csrmul_cached( uint *Ap, uint *Aj, float *Av, uint num_rows,float *x, float *y) { __shared__ float cache[blocksize]; uint block_begin = blockIdx.x * blockDim.x; uint block_end = block_begin + blockDim.x; uint row = block_begin + threadIdx.x: if (row < num_rows) cache[threadIdx.x] = x[row]; __syncthreads(); if (row<num_rows) { uint row_begin = Ap[row]; uint row_end = Ap[row + l] ; float sum = 0, x_j ; for (uint col=row_begin; col < row_end; col++ ) { uint j = Aj[col]; // Fetch x_j from our cache when possible if (j>=block_begin && j<block_end ) x_j = cache [j – block_begin]; else x_j = x [j]; sum += Av[col] * x_j ; } y[row] = sum; } }

More Related