440 likes | 608 Views
GPU Computing Techniques. Using CUDA. CUDA & the GPU:. Host Interface & Giga Thread Engine. Raster Engine. Raster Engine. GPU Cores. GPU Cores. Memory Controller. Shared L1 Cache. GPU Cores. GPU Cores. Raster Engine. CUDA & the GPU:. Host Interface & Giga Thread Engine.
E N D
GPU Computing Techniques Using CUDA
CUDA & the GPU: Host Interface & Giga Thread Engine Raster Engine Raster Engine GPU Cores GPU Cores Memory Controller Shared L1 Cache GPU Cores GPU Cores Raster Engine
CUDA & the GPU: Host Interface & Giga Thread Engine Raster Engine Raster Engine GPU Cores GPU Cores Memory Controller Shared L1 Cache GPU Cores Raster Engine
? CUDA & the GPU: Host Interface & Giga Thread Engine Raster Engine Raster Engine GPU Cores Memory Controller Shared L1 Cache GPU Cores Raster Engine
Optimization Techniques • Areas in which performance gains can be achieved: - Memory Optimization • L1, L2,Global Memory, Shared Memory, etc… - Increasing Parallelism Between GPU/CPU
Improving GPU Performance: • Global Memory Coalescing • Shared Memory & Bank Conflicts • L1 Cache Performance • GPU-CPU Interaction Optimization
Global Memory Coalescing Using CUDA
Modern DRAMs (dynamic random access memories) use a parallel process to increase their rate of data access. Each time a location is accessed, many consecutive locations that includes the requested location are accessed. Once detected, the data from all of these consecutive locations in the global memory can be transferred to the processor at high speed. [PM01] Global Memory Bandwidth
Global Memory Bandwidth • When all threads in a warp (32 threads) execute a load instruction, the hardware detects whether the threads access consecutive global memory locations. • Then a kernel arranges its data accesses so that each request to consecutive DRAM locations can be identified. • The GPU allows the programmers to achieve high global memory accesses of threads into favorable patterns – the same instruction for all threads in a warp accesses consecutive global memory locations.
Example - Matrix • Coalesced pattern (B) • Threads in warp 0 reads element 0 of columns 0 through 31. • Threads in warp 1 reads element 1 of columns 0 through 31. • and so on…
Global Memory Coalescing - Matrix • How these matrix elements are placed into the global memory: • All elements in a row are placed in a consecutive locations. (row major order)
Global Memory Coalescing - Matrix • Favorable matrix data access pattern: • The hardware detects that these accesses are to consecutive locations in the global memory.
Global Memory Coalesing - Matrix • Not coalesced memory layout:
Shared Memory & Bank Conflicts Using CUDA
Shared Memory • Because it is on-chip, the shared memory space is much faster than the global memory spaces. • To achieve high bandwidth, shared memory is divided into equally-sized memory modules, called banks.
Shared Memory and Banks • Maximum amount of shared memory per multiprocessor — 48 KB • In Hydra, 48 KB × 14 (mp) = 672 KB • There are 32 banks, which are organized such that successive 32-bit words are assigned to successive banks • Each bank has a bandwidth of 32 bits per two clock cycles.
Shared Memory Banks • Banks can be accessed simultaneously. • Memory access request made of n addresses that fall in n distinct memory banks can be serviced simultaneously, yielding an overall bandwidth that is n times as high as the bandwidth of a single module. • If n memory bank accesses request attempt to access the same memory bank, there is a bank conflict (called an n-way bank conflict) → The access has to be serialized by hardware → Results in decreasing throughput
Strided Shared Memory Accesses • Left: Linear addressing with a stride of one 32-nit word (no bank conflict) • Middle: Linear Addressing with a stride of two 32-bit words (2-way bank conflicts) • Right: Linear addressing with a stride of three 32-bit words (no bank conflict)
Access Efficiency • Left: Random Access (No Conflict) • Middle: Random Conflict (5-way Bank Conflict) • Right: Bad Approach to Bank Access (32-way Bank Conflict)
Broadcast Access • A 32-bit word can be read and broadcast to several threads simultaneously when servicing one memory read request. • No Conflicts result, because the broad cast is only about reading data.
Irregular & Colliding Shared Memory Access • Left: Conflict-free access via random permutation • Middle: Conflict-free access since threads 3, 4, 6, 7,and 9, access the same word within bank 5 • Right: Conflict-free broadcast access (all threads access the same word)
Bank Conflict Example • 8-bit and 16-bit accesses typically generate bank conflicts. __shared__ char shared[32]; char data = shared[BaseIndex + tid]; • shared[0], shared[1], shared[2], and shared[3], for example, belong to the same bank 32 bit . . . shared[0] shared[1] shared[2] shared[3] . . . shared[31]
Bank Conflict Example (continued) • No bank conflicts char data = shared[BaseIndex + 4 * tid]; . . . shared[0] shared[1] shared[2] shared[3] . . .
Effects of L1 Cache Manipulation Using CUDA
What Does L1 Do? • When a memory location is request, the L1 cache is queried first. If the address is not found (this is called a miss), the L2 cache is queried, and so on, until main memory is accessed. • When a memory request is serviced, the resulting memory address is populated all the way through the caches, from main memory to L1. • Next memory call to that address should be serviced quicker, as long as it is not displaced from the L1 cache.
Why Turn Off L1 Cache? • L1 Caches exist to improve memory request performance by increasing request throughput and minimizing request latency. • Why on earth would you want to disable the L1 Cache? • Can We Disable L2 or L3?
Toggling L1 in GPU • An available compiler option when compiling any code is to disable the L1 cache. • When compiling CUDA code, each mini-program can be compiled with or without L1 cache enabled. • Benchmark each executable to see if the code runs faster with or without L1 enabled. • When the main application is compiled, the application will link with the CUDA mini-executables.
CPU-GPU Interaction Optimization Using CUDA
CPU-GPU interaction • One of the key optimization for any GP-GPU application. Cause: • PCI-Bandwidth much lower than GPU memory bandwidth. • 1.6 to 8GB/s vs 1774 GB/s • Problems faced Host Memory Global Memory PCI Express 8GB/s PCI Express 8GB/s 50 GB/s 175 GB/s CPU GPU Motherboard Graphics Card
Remedy: CPU-GPU Data Transfers • Minimal Transfer • Intermediate data directly on GPU • Move Codes with less data transfer to GPU • Group Transfer • One larger transfer rather than multiple smaller transfer.
Short-comings In Remedies • Minimal transfer is not applicable for all kind of GP-GPU applications. • Group transfer does not reduce or hide the CPU-GPU data transfer latency. Hence There is need for the optimization of data transfers
Optimizations by CUDA • Pinned or Non-pagable memory optimization • Decrease the time to copy data from CPU-GPU • Optimization through multiple streams. • Hides the transfer time by overlapped execution of kernel and memory transfers.
Pinned memory • What is Pinned or page locked memory ? • Not paged in or out by OS. • Pinned memory enables • Faster PCI-e copies • Memory copies are asynchronous with CPU • Memory copies are asynchronous with GPU • Zero-copy • cudaMemcpy(dest, src, size, direction); • Drawback is it reduces the RAM available for OS
Concurrent Execution betweenHost and Device • In order to facilitate concurrent execution between host and device, some function calls are asynchronous • Examples of asynchronous calls • Kernel launches • Device ↔ device memory copies • Host ↔ device memory copies • CudaMemCpyAsyn(dest, src, size, direction, stream#);
Overlapping executions concern • When is this overlapping useful? • Note that there is a issue with this idea: • The device execution stack is FIFO • This would prevent overlapping execution with data transfer • This issue was addressed by the use of CUDA “streams”
CUDA Streams: Overview • A stream is a sequence of CUDA commands that execute in order • Look at a stream as a queue of GPU operations • One host thread can define multiple CUDA streams • What are the typical operations in a stream? • Invoking a data transfer • Invoking a kernel execution • Handling events
Streams and Asynchronous Calls • Default API • Kernel launches are asynchronous with CPU. • Memory copies (H2D or D2H) block CPU thread. • Streams and Asynchronous functions provide • Memory copies asynchronous with CPU • Operation in different streams can be overlapped • A kernel and memory copies in different Streams can be overlapped
Overlap of kernel and memory copy using CUDA streams • Requirements • D2H and H2D memcopy from pinned memory • Kernel and memcopy in different, non-zero streams • Code: cudaStream_t stream1, stream2; cudaStreamCreate(&stream1); cudaStreamCreate(&stream2); cudeMemcpyAsync(dst, src, size, dir, stream1); kernel<<<grid, block, 0, stream2>>>(…); Potentially Overlapped
In Summary Thus, Therefore, However …
Project Research Goals • Implement and Analyze various CUDA /GPU applications that demonstrate the previously talked about techniques and issue, such as: • Shared Memory Coalescing • Bank Access in Shared Memory • Effects of L1 Availability • Explain how and why streams can optimize CPU/GPU data transfers. • Study experiment to determine if assumptions match hypothetical results.
References [NV01] NVIDIA CUDA C Programming Guide, Version 4.0, May 2011. Section 5.3.2.3 [PM01] David B. Kirk, Wen-mei W. Hwu, Programming Massively Parallel Processors, Nvidia Corporation, 2010. pp 103-108. [BP01] NVidia Cuda C Best Practice Guide, Version 4.0, May 2011, Section 3.2.1, pp 25-30