1 / 7

Lab Assignment #2 Data Parallel Reduction

Lab Assignment #2 Data Parallel Reduction. Farhad Parsan. Data Parallel Reduction. Sum reduction kernel (with thread divergence). Data Parallel Reduction. Sum reduction kernel (without thread divergence). Host Code. float computeOnDevice(float* h_data, int num_elements) {

march
Download Presentation

Lab Assignment #2 Data Parallel Reduction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lab Assignment #2Data Parallel Reduction Farhad Parsan

  2. Data Parallel Reduction • Sum reduction kernel (with thread divergence)

  3. Data Parallel Reduction • Sum reduction kernel (without thread divergence)

  4. Host Code float computeOnDevice(float* h_data, int num_elements) { intsize = num_elements*sizeof(float); float result; float* hd_data; // 1. Allocate and Load cudaMalloc((void**) &hd_data, size); cudaMemcpy(hd_data, h_data, size, cudaMemcpyHostToDevice); // 2. Kernel invocation code dim3 dimBlock(num_elements,1); dim3 dimGrid(1, 1); reduction<<<dimGrid, dimBlock>>>(hd_data); // 3. Store result cudaMemcpy(h_data, hd_data, size, cudaMemcpyDeviceToHost); result = h_data[0]; // Free device matrices cudaFree(hd_data); return result; }

  5. Device Code #define NUM_ELEMENTS 512 __global__ void reduction(float *hd_data) { __shared__ float partialSum[NUM_ELEMENTS] unsigned int t = threadIdx.x; partialSum[t] = hd_data[t]; for (unsigned int stride = blockDim.x; stride > 1; stride >> 1) { __syncthreads(); if (t < stride) partialSum[t] += partialSum[t+stride]; } hd_data[t] = partialSum[t]; }

  6. Question • How many times does your thread block synchronize to reduce the array of 512 elements to a single value? Number of synchronizations = Number of reduction iterations = Log2N − 1 Assuming N = 512 => Number of synchronizations = 8

  7. Question • What is the minimum, maximum, and average number of "real" operations that a thread will perform? “real" operations are those that directly contribute to the final reduction value. Maximum : Thread 1 = Log2N − 1 if N = 512 => Maximum = 8 Minimum : Odd Threads = 1 Average : [ 1 + 2 + 4 + … + (N/2) ] / N if N = 512 => Average = 0.998 ≈ 1

More Related