1 / 19

Using CUDA for High Performance Scientific Computing

Using CUDA for High Performance Scientific Computing. Dana Schaa NUCAR Research Group Northeastern University. Outline. What is CUDA? Concepts and Terminology Program Design with CUDA Example Programs Ideal Characteristics for Graphics Processing. CUDA and nVidia.

yovela
Download Presentation

Using CUDA for High Performance Scientific Computing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Using CUDA for High Performance Scientific Computing • Dana Schaa • NUCAR Research Group • Northeastern University

  2. Outline • What is CUDA? • Concepts and Terminology • Program Design with CUDA • Example Programs • Ideal Characteristics for Graphics Processing

  3. CUDA and nVidia • CUDA = Compute Unified Device Architecture • CUDA is a programming interface • Portions of the code are targeted to run on an nVidia GPU • Provides an extension to C • Only functions from the C standard library are supported • nvcc - nVidia CUDA compiler

  4. GeForce 8800 Architecture GeForce 8800 multi-processor mp mp Device Memory (768MB) Local Global Etc shared 16KB mp mp uP uP mp mp uP uP mp mp uP uP mp mp mp mp uP uP mp mp mp mp

  5. CUDA Terminology • A kernel is a portion of a program that is executed many times (independently) and on different data • The host is the CPU that is executing the code (i.e. the CPU of the system that the nVidia board is plugged in to) • The device is the nVidia board itself

  6. CUDA Terminology (2) • A thread is an instance of a computational kernel • Threads are arranged into SIMD groups, called warps • The warp size for the 8800 Series is 32 • A block is a group of threads • One block at a time is assigned to a multiprocessor • A grid is a group of blocks • All of the threads in a grid perform the same task • Blocks within a grid can not run different operations

  7. CUDA Terminology (3) GRID BLOCK BLOCK BLOCK BLOCK W W W W W W W W W W W W BLOCK BLOCK BLOCK BLOCK W W W W W W W W W W W W BLOCK BLOCK BLOCK BLOCK W W W W W W W W W W W W BLOCK BLOCK BLOCK BLOCK W W W W W W W W W W W W

  8. Program Design - Threads • More threads per block are better for time slicing • Minimum: 64, Ideal: 192-256 • More threads per block means fewer registers per thread • Kernel invocation may fail if the kernel compiles to more registers than are available • Threads within a block can be synchronized • Important for SIMD efficiency • The maximum threads allowed per grid is 64K^3

  9. Program Design - Blocks • There should be at least as many blocks as multiprocessors • The number of blocks should be at least 100 to scale to future generations • Blocks within a grid can not be synchronized • Blocks can only be swapped by partitioning registers and shared memory among them

  10. Program Design: On-Chip Memory • Registers • 32 registers per processor • Shared Memory - per block • 16KB per multiprocessor • Data should be in 32-bit increments to take advantage of concurrent accesses • Access is as fast as register access if no banks conflict

  11. a - no conflict • b - no conflict • c - conflict, must be serialized a b c

  12. Program Design: Off-Chip Memory • Local (per thread) and Global (per grid) Memories • 200-300 cycle latency • Global memory is accessible from the host • Local memory is used for data and variables that cant fit in registers • 768MB (includes constant and texture memories) • 64-128 bit accesses • Are the different memories variable in size?

  13. Program Design: Off-Chip Memory (2) • Host-Device • Transfers are much slower than intra-device transfers • Small transfers should be batched • Intermediate data should be kept on device

  14. Program Design: Memory GeForce 8800 multi-processor mp mp Device Memory (768MB) Local Global Etc shared 16KB mp mp uP uP mp mp uP uP mp mp uP uP mp mp mp mp uP uP mp mp mp mp

  15. Program Design: Memory (2) • Typical memory access pattern • Copy data from global memory to shared memory • Synchronize threads • Manipulate the data in shared memory • Synchronize threads • Copy data back to global memory from shared memory

  16. Program Design: Control Flow • Since the hardware is SIMD, control flow instructions can cause thread execution paths to diverge • Divergent execution paths must be serialized (costly) • if, switch, and while statements should be avoided if threads from the same warp will take different paths • The compiler may remove if statements if favor of predicated instructions to prevent divergence

  17. MatrixMul example

  18. MatrixMul example (2)

  19. Ideal CUDA Programs • High intrinsic parallelism • per-pixel or per-element operations • fft, matrix multiply • most image processing applications • Minimal communication (if any) between threads • limited synchronization • Few control flow statements • High ratio of arithmetic to memory operations

More Related