CS 179: Lecture 3

CS 179: Lecture 3

A bold problem! • Suppose we have a polynomial P(r) with coefficients c0, …, cn-1, given by: • We want, for r0, …, rN-1, the sum:

A kernel !!

A more correct kernel

Performance problem! • Serialization of atomicAdd()

Parallel accumulation …

Parallel accumulation

Accumulation methods – A visual • Purely atomic: • “Linear accumulation”: *output

Multiprocessor (Blocks go here) SIMD processing unit (Warps go here) GPU: Internals Block Warp

Warp • 32 threads that execute simultaneously! • A[] + B[] -> C[] problem: • Didn’t matter! • Warp of threads 0-31: • A[idx] + B[idx] -> C[idx] • Warp of threads 32-63: • A[idx] + B[idx] -> C[idx] • Warp of threads 64-95: • A[idx] + B[idx] -> C[idx] • …

Divergence – Example //Suppose we have a pointer to some floating-point //value *val... if (threadIdx.x % 2 == 0) *value += 2; //Branch A else *value /= 3; //Branch B • Branches result in different instructions!

Divergence • What happens: • Executes normally until if-statement • Branches to calculate Branch A (blue threads) • Goes back (!) and branches to calculate Branch B (red threads)

Calculating polynomial values – does it matter? • Warp of threads 0-31: • (calculate some values) • Warp of threads 32-63: • (calculate some values) • Warp of threads 64-95: • (calculate some values) • Same instructions! Doesn’t matter!

Linear reduction - does it matter? • (after calculating values…) • Warp of threads 0-31: • Thread 0: Accumulate sum • Threads 1-31: Do nothing • Warp of threads 32-63: • Do nothing • Warp of threads 64-95: • Do nothing • Doesn’t really matter… in this case.

Improving our reduction • More threads participating in the process! • “Binary tree”

Improving our reduction

Improving our reduction //Let our shared memory block be partial_outputs[]... synchronize threads before starting... set offset to 1 while ( (offset * 2) <= block dimension): if (thread index % (offset * 2) is 0) AND you won’t exceed your block dimension: add partial_outputs[thread index + offset] to partial_outputs[thread index] double the offset synchronize threads Get thread 0 to atomicAdd() partial_outputs[0] to output

Improving our reduction • What will happen? • Each warp will do meaningful execution on ~16 threads • You use lots more warps than you have to! • A “divergent tree” • How to improve?

“Non-divergent tree”

“Non-divergent tree” //Let our shared memory block be partial_outputs[]... set offset to highest power of 2 that’s less than the block dimension //For the first iteration, check that you don’t access //out of range memory while (offset >= 1): if (thread index < offset): add partial_outputs[thread index + offset] to partial_outputs[thread index] halve the offset synchronize threads Get thread 0 to atomicAdd() partial_outputs[0] to output

“Non-divergent tree” • Suppose we have our block of 512 threads… • Warp of threads 0-31: • Accumulate result • … • Warp of threads 224-255: • Accumulate result • Warp of threads 256-287: • Do nothing • … • Warp of threads 480-511: • Do nothing

“Non-divergent tree” • Suppose we’re now down to 32 threads… • Warp of threads 0-31: • Threads 0-15: Accumulate result • Threads 16-31: Do nothing • Much less divergence • Divergence only occurs in the middle, if ever!

Reduction – Four Approaches • Atomic only: • Divergent tree: • Linear: • Non-divergent tree: *output

Notes on files? (An aside) • Labs and CUDA programming typically have the following files involved: • ____.cc • Allows C++ code • g++ compiles this • ____.cu • Allows CUDA syntax • nvcc compiles this • ____.cuh • CUDA header file (declare accessible functions)

Big Picture (so far) • CUDA allows large speedups! • Parallelism with CUDA: Different from CPU parallelism • Threads are “small” • Memory to think about • Increase speedup by thinking about problem constraints! • Reduction • (and more!)

Big Picture (so far) • Steps in these two problems are widely applicable! • Dot product • Norm calculation • (and more)

Other notes • Office hours: • Monday: 8-10 PM • Tuesday: 8-10 PM • Lab 1 due Wednesday, 5 PM

CS 179: Lecture 3

CS 179: Lecture 3

Presentation Transcript

CS 201 Lecture 3:

CS 179: Lecture 4 Lab Review 2

CS 179 : Lecture 12

CS 179: GPU Programming

CS 575 Lecture 3 Spring 2012

CS 0004 –Lecture 3

CS 179: GPU Programming

CS 179 Database Project

CS 519: Lecture 3

CS 300 – Lecture 3

CS 179: Lecture 2 Lab Review 1

CS 7810 Lecture 3

CS 179 Lecture 6

CS 584 Lecture 3

CS 7810 Lecture 3

CS 160: Lecture 3

CS 268: Lecture 3

CS 160: Lecture 3

CS 11 java track: lecture 3