Training Program on GPU Programming with CUDA

Training Program onGPU Programming with CUDA 31st July, 7th Aug, 14th Aug 2011 CUDA Teaching Center @ UoM

Day 1, Session 2 CUDA Programming Model CUDA Threads Training Program on GPU Programming with CUDA Sanath Jayasena CUDA Teaching Center @ UoM

Outline for Day 1 Session 2 CUDA Programming Model, CUDA Threads • Data Parallelism • CUDA Program Structure • Memory Model & Data Transfer (Brief) • Kernel Functions & Threading (Discussion with Example: Matrix Multiplication) CUDA Training Program

Data Parallelism • Data Parallelism • A problem/program property • Many arithmetic operations can be safely performed on the data structures simultaneously • Example: matrix multiplication (next slide) • CUDA devices can exploit data parallelism to accelerate execution of applications CUDA Training Program

Example: Matrix Multiplication • P = M · N • Each element in P is computed as dot product between a row of M and a column of N • All elements in P can be computed independently and simultaneously N width M P width width width CUDA Training Program

CUDA Program Structure • A CUDA program consists of one or more phases executed on either the host (CPU) or a device (GPU), supplied as a single source code • Little or no data parallelism  host code • ANSI C, compiled with standard compiler • Significant data parallelism  device code • Extended ANSI C to specify kernels, data structs • NVIDIA C Complier separates the two and … CUDA Training Program

Execution of a CUDA Program CUDA Training Program

Execution of a CUDA Program • Execution starts with host (CPU) • When a kernel is invoked, execution moves to the device (GPU) • A large number of threads generated • Grid : collection of all threads generated by kernel • (Previous slide shows two grids of threads) • Once all threads in a grid complete execution, the grid terminates and execution continues on the host CUDA Training Program

Example: Matrix Multiplication int main (void) { 1. // Allocate and initialize matrices M, N, P // I/O to read the input matrices M and N …. 2. // M * N on the device MatrixMulOnDevice (M, N, P, width); 3. // I/O to write the output matrix P // Free matrices M, N, P … return 0; } A simple CUDA host code skeleton for matrix multiplication CUDA Training Program

CUDA Device Memory Model • Host, devices have separate memory spaces • E.g., hardware cards with their own DRAM • To execute a kernel on a device • Need to allocate memory on device • Transfer data: host memory  device memory • After device execution • Transfer results: device memory  host memory • Free device memory no longer needed CUDA Training Program

CUDA Device Memory Model CUDA Training Program

CUDA API : Memory Mgt. CUDA Training Program

CUDA API : Memory Mgt. • Example float *Md; int size = Width * Width * sizeof(float); cudaMalloc((void**)&Md, size); … cudaFree(Md); CUDA Training Program

CUDA API : Data Transfer CUDA Training Program

Example: Matrix Multiplication CUDA Training Program

Kernel Functions & Threading • A kernel function specifies the code to be executed by all threads of a parallel phase • All threads of a parallel phase execute the same code  single-program multiple-data (SPMD), a popular programming style for parallel computing • Need a mechanism to • Allow threads to distinguish themselves • Direct themselves to specific parts of data they are supposed to work on CUDA Training Program

Kernel Functions & Threading • Keywords “threadIdx.x” and “threadIdx.y” • Thread indices of a thread • Allow a thread to identify itself at runtime (by accessing hardware registers associated with it) • Can refer a thread as Thread threadIdx.x,threadIdx.y • Thread indices reflect a multi-dimensional organization for threads CUDA Training Program

Example: Matrix Multiplication Kernel See next slide for more details on accessing relevant data CUDA Training Program

Thread Indices & Accessing Data Relevant to a Thread x Nd How matrix Pd would be laid out in memory (as it is a 1-D array) tx row 0 row 1 y width Pd width ty * width tx • Each thread uses tx, ty to identify the relevant row of Md, column of Nd and the element of Pd in the for loop • E.g., Thread2,3 will perform dot product between row 2 of Md and column 3 of Nd and write the result into element (2,3) of Pd Md Pd ty ty tx width CUDA Training Program

Threading & Grids • When a kernel is invoked/launched, it is executed as a grid of parallel threads • A CUDA thread grid can have millions of lightweight GPU threads per kernel invocation • To fully utilize hardware  enough threads required large data parallelism required • Threads in a grid has a two-level hierarchy • A grid consists of 1 or more thread blocks • All blocks in a grid have same # of threads CUDA Training Program

CUDA Thread Organization CUDA Training Program

Threading with Grids & Blocks • Each thread block has a unique 2-D coordinate given by CUDA keywords “blockIdx.x” and “blockIdx.y” • All blocks must have the same structure, thread # • Each block has a 3-D array of threads up to a total of 1024 threads max • Coordinates of threads in a block are defined by indices: threadIdx.x, threadIdx.y, threadIdx.z • (Not all apps will use all 3 dimensions) CUDA Training Program

Our Example: Matrix Multiplication • The kernel is shown 5 slides before (slide 18) • This can only use one thread block • The block is organized as a 2D-array • The code can compute a product matrix Pd of only up to 1024 elements • As a block can have a max of 1024 threads • Each thread computes one element in Pd • Is this sufficient / acceptable? CUDA Training Program

Our Example: Matrix Multiplication • When host code invokes the kernel, the grid and block dimensions are set by passing them as parameters • Example // Setup the execution configuration dim3 dimBlock(16, 16, 1); //Width=16, as example dim3 dimGrid(1, 1, 1); //last 1 ignored // Launch the device computation threads! MatrixMulKernel<<<dimGrid,dimBlock>>>(Md,Nd,Pd,16); CUDA Training Program

Here is an Exercise… • Implement Matrix Multiplication • Execute it with different matrix dimensions using (a) CPU only, (b) GPUs and (c) GPUs with different grid/block organizations • Fill a table like the following CUDA Training Program

Conclusion • We discussed CUDA Programming Model and CUDA Thread Basics • Data Parallelism • CUDA Program Structure • Memory Model & Data Transfer (briefly) • Kernel Functions & Threading • (Discussion with Example: Matrix Multiplication) CUDA Training Program

References for this Session • Chapter 2 of: D. Kirk and W. Hwu, Programming Massively Parallel Processors, Morgan Kaufmann, 2010 • Chapters 4-5 of: E. Kandrot and J. Sanders, CUDA by Example, Addison-Wesley, 2010 • Chapter 2 of: NVIDIA CUDA C Programming Guide, V. 3.2/4.0, NVIDIA Corp. , 2010-2011 CUDA Training Program

Training Program on GPU Programming with CUDA