270 likes | 398 Views
CS 179: Lecture 3. A bold problem!. Suppose we have a polynomial P(r) with coefficients c 0 , …, c n -1 , given by: We want, for r 0 , …, r N -1 , the sum:. A kernel. !!. A more correct kernel. Performance problem!. Serialization of atomicAdd (). Parallel accumulation. ….
E N D
A bold problem! • Suppose we have a polynomial P(r) with coefficients c0, …, cn-1, given by: • We want, for r0, …, rN-1, the sum:
A kernel !!
Performance problem! • Serialization of atomicAdd()
Accumulation methods – A visual • Purely atomic: • “Linear accumulation”: *output
Multiprocessor (Blocks go here) SIMD processing unit (Warps go here) GPU: Internals Block Warp
Warp • 32 threads that execute simultaneously! • A[] + B[] -> C[] problem: • Didn’t matter! • Warp of threads 0-31: • A[idx] + B[idx] -> C[idx] • Warp of threads 32-63: • A[idx] + B[idx] -> C[idx] • Warp of threads 64-95: • A[idx] + B[idx] -> C[idx] • …
Divergence – Example //Suppose we have a pointer to some floating-point //value *val... if (threadIdx.x % 2 == 0) *value += 2; //Branch A else *value /= 3; //Branch B • Branches result in different instructions!
Divergence • What happens: • Executes normally until if-statement • Branches to calculate Branch A (blue threads) • Goes back (!) and branches to calculate Branch B (red threads)
Calculating polynomial values – does it matter? • Warp of threads 0-31: • (calculate some values) • Warp of threads 32-63: • (calculate some values) • Warp of threads 64-95: • (calculate some values) • Same instructions! Doesn’t matter!
Linear reduction - does it matter? • (after calculating values…) • Warp of threads 0-31: • Thread 0: Accumulate sum • Threads 1-31: Do nothing • Warp of threads 32-63: • Do nothing • Warp of threads 64-95: • Do nothing • Doesn’t really matter… in this case.
Improving our reduction • More threads participating in the process! • “Binary tree”
Improving our reduction //Let our shared memory block be partial_outputs[]... synchronize threads before starting... set offset to 1 while ( (offset * 2) <= block dimension): if (thread index % (offset * 2) is 0) AND you won’t exceed your block dimension: add partial_outputs[thread index + offset] to partial_outputs[thread index] double the offset synchronize threads Get thread 0 to atomicAdd() partial_outputs[0] to output
Improving our reduction • What will happen? • Each warp will do meaningful execution on ~16 threads • You use lots more warps than you have to! • A “divergent tree” • How to improve?
“Non-divergent tree” //Let our shared memory block be partial_outputs[]... set offset to highest power of 2 that’s less than the block dimension //For the first iteration, check that you don’t access //out of range memory while (offset >= 1): if (thread index < offset): add partial_outputs[thread index + offset] to partial_outputs[thread index] halve the offset synchronize threads Get thread 0 to atomicAdd() partial_outputs[0] to output
“Non-divergent tree” • Suppose we have our block of 512 threads… • Warp of threads 0-31: • Accumulate result • … • Warp of threads 224-255: • Accumulate result • Warp of threads 256-287: • Do nothing • … • Warp of threads 480-511: • Do nothing
“Non-divergent tree” • Suppose we’re now down to 32 threads… • Warp of threads 0-31: • Threads 0-15: Accumulate result • Threads 16-31: Do nothing • Much less divergence • Divergence only occurs in the middle, if ever!
Reduction – Four Approaches • Atomic only: • Divergent tree: • Linear: • Non-divergent tree: *output
Notes on files? (An aside) • Labs and CUDA programming typically have the following files involved: • ____.cc • Allows C++ code • g++ compiles this • ____.cu • Allows CUDA syntax • nvcc compiles this • ____.cuh • CUDA header file (declare accessible functions)
Big Picture (so far) • CUDA allows large speedups! • Parallelism with CUDA: Different from CPU parallelism • Threads are “small” • Memory to think about • Increase speedup by thinking about problem constraints! • Reduction • (and more!)
Big Picture (so far) • Steps in these two problems are widely applicable! • Dot product • Norm calculation • (and more)
Other notes • Office hours: • Monday: 8-10 PM • Tuesday: 8-10 PM • Lab 1 due Wednesday, 5 PM