270 likes | 506 Views
GPU Architecture and Programming. GPU vs CPU. https://www.youtube.com/watch?v=fKK933KK6Gg. GPU Architecture. GPU (Graphics Processing Unit) were originally designed as graphics accelerators, used for real-time graphics rendering.
E N D
GPU vs CPU https://www.youtube.com/watch?v=fKK933KK6Gg
GPU Architecture • GPU (Graphics Processing Unit) were originally designed as graphics accelerators, used for real-time graphics rendering. • Starting in the late 1990s, the hardware became increasingly programmable, culminating in NVIDIA's first GPU in 1999.
CPU + GPU is a powerful combination • CPUs consist of a few cores optimized for serial processing, • GPUs consist of thousands of smaller, more efficient cores designed for parallel performance. • Serial portions of the code run on the CPU while parallel portions run on the GPU
Architecture of GPU Image copied from http://www.pgroup.com/lit/articles/insider/v2n1a5.htm Image copied from http://people.maths.ox.ac.uk/~gilesm/hpc/NVIDIA/NVIDIA_CUDA_Tutorial_No_NDA_Apr08.pdf
CUDA Programming • CUDA (Compute Unified Device Architecture) is a parallel programming platform created by NVIDIA based on its GPUs. • By using CUDA, you can write programs that directly access GPU. • CUDA platform is accessible to programmers via CUDA libraries and extensions to programming languages like C, C++ AND Fortran. • C/C++ programmers use “CUDA C/C++”, compiled with nvcc compiler • Fortran programmers can use CUDA Fortran, compiled with PGI CUDA Fortran
Terminology: • Host: The CPU and its memory (host memory) • Device: The GPU and its memory (device memory)
Programming Paradigm Parallel function of application: execute as a kernel Copy from http://on-demand.gputechconf.com/gtc-express/2011/presentations/GTC_Express_Sarah_Tariq_June2011.pdf
Programming Flow • Copy input data from CPU memory to GPU memory • Load GPU program and execute • Copy results from GPU memory to CPU memory
Each parallel function of application is execute as a kernel • That means GPUs are programmed as a sequence of kernels; typically, each kernel completes execution before the next kernel begins. • Fermi has some support for multiple, independent kernels to execute simultaneously, but most kernels are large enough to fill the entire machine.
Image copied from http://people.maths.ox.ac.uk/~gilesm/hpc/NVIDIA/NVIDIA_CUDA_Tutorial_No_NDA_Apr08.pdf
Hello World! Example • _ _global_ _ is a CUDA C/C++ keyword meaning • mykernel() will be exectued on the device • mykernel() will be called from the host Copy from http://on-demand.gputechconf.com/gtc-express/2011/presentations/GTC_Express_Sarah_Tariq_June2011.pdf
Addition Example • Since add runs on device, pointers a, b, and c must point to device memory Copy from http://on-demand.gputechconf.com/gtc-express/2011/presentations/GTC_Express_Sarah_Tariq_June2011.pdf
Copy from http://on-demand.gputechconf.com/gtc-express/2011/presentations/GTC_Express_Sarah_Tariq_June2011.pdf
Vector Addition Example Kernel Function: Copy from http://on-demand.gputechconf.com/gtc-express/2011/presentations/GTC_Express_Sarah_Tariq_June2011.pdf
main: Copy from http://on-demand.gputechconf.com/gtc-express/2011/presentations/GTC_Express_Sarah_Tariq_June2011.pdf
Alternative 1: Copy from http://on-demand.gputechconf.com/gtc-express/2011/presentations/GTC_Express_Sarah_Tariq_June2011.pdf
Alternative 2: intglobalThreadId= threadIdx.x + blockIdx.x * M //M is the number of threads in a block IntglobalThreadId= threadIdx.x + blockIdx.x * blockDim.x Copy from http://on-demand.gputechconf.com/gtc-express/2011/presentations/GTC_Express_Sarah_Tariq_June2011.pdf
So the kernel becomes Copy from http://on-demand.gputechconf.com/gtc-express/2011/presentations/GTC_Express_Sarah_Tariq_June2011.pdf
The main becomes Copy from http://on-demand.gputechconf.com/gtc-express/2011/presentations/GTC_Express_Sarah_Tariq_June2011.pdf
Handling Arbitrary Vector Sizes Copy from http://on-demand.gputechconf.com/gtc-express/2011/presentations/GTC_Express_Sarah_Tariq_June2011.pdf