1 / 24

List Ranking and Parallel Prefix

List Ranking and Parallel Prefix. Sathish Vadhiyar. List Ranking on GPUs. Linked list prefix computations – computations of prefix sum on the elements contained in a linked list Linked list represented as an array

phyre
Download Presentation

List Ranking and Parallel Prefix

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. List Ranking and Parallel Prefix Sathish Vadhiyar

  2. List Ranking on GPUs • Linked list prefix computations – computations of prefix sum on the elements contained in a linked list • Linked list represented as an array • Irregular memory accesses – successor of each node of a linked list can be contained anywhere • List ranking – special case of list prefix computations in which all the values are identity, i.e., 1.

  3. List ranking • L is a singly linked list • Each node contains two fields – a data field, and a pointer to the successor • Prefix sums – updating data field with summation of values of its predecessors and itself • L represented by an array X with fields X[i].prefix and X[i].succ

  4. Sequential Algorithm • Simple and effective • Two passes • Pass 1: To identify the head node • Pass 2: Traverses starting from the head, follow the successor nodes accumulating the prefix sums in the traversal order • Works well in practice

  5. Parallel Algorithm: Prefix computations on arrays • Array X partitioned into subarrays • Local prefix sums of each subarray calculated in parallel • Prefix sums of last elements of each subarray written to a separate array Y • Prefix sums of elements in Y are calculated. • Each prefix sum of Y is added to corresponding block of X • Divide and conquer strategy

  6. Example 123456789 • 456 789 1,3,6 4,9,15 7,15,24 6,15,24 6,21,45 1,3,6,10,15,21,28,36,45 Divide Local prefix sum Passing last elements to a processor Computing prefix sum of last elements on the processor Adding global prefix sum to local prefix sums in each processor

  7. Prefix computation on list • The previous strategy cannot be applied here • Division of array X that represents list will lead to subarrays each of which can have many sublist fragments • Head nodes will have to be calculated for each of them

  8. Parallel List Ranking (Wyllie’s algorithm) • Involved repeated pointer jumping • Successor pointer of each element is repeatedly updated so that it jumps over its successor until it reaches the end of the list • As each processor traverses and updates the successor, the ranks are updated • A process or thread is assigned to each element of the list

  9. Parallel List Ranking (Wyllie’s algorithm) • Will lead to high synchronizations among threads • In CUDA - many kernel invocations

  10. Parallel List Ranking (Helman and JaJa) • Randomly select s nodes or splitters. The head node is also a splitter • Form s sublists. In each sublist, start from a splitter as the head node, and traverse till another splitter is reached. • Form prefix sums in each sublist • Form another list, L’, consisting of only these splitters in the order they are traversed. The values in each entry of this list will be the prefix sum calculated in the respective sublists • Calculate prefix sums for this list • Add these sums to the values of the sublists

  11. Parallel List Ranking on GPUs: Steps • Step 1: Compute the location of the head of the list • Each of the indices between 0 and n-1, except head node, occur exactly only once in the successors. • Hence head node = n(n-1)/2 – SUM_SUCC • SUM_SUCC = sum of the successor values • Can be done on GPUs using parallel reduction

  12. Parallel List Ranking on GPUs: Steps • Step 2: Select s random nodes to split list into s random sublists • For every subarray of X of size X/s, select random location as a splitter. • Highly data parallel, can be done independent of each other

  13. Parallel List Ranking on GPUs: Steps • Step 3: Using standard sequential algorithm, compute prefix sums of each sublist separately • The most computationally demanding step • s sublists allocated equally among CUDA blocks, and then allocated equally among threads in a block • Each thread computes prefix sums of each of its sublists, and copy prefix value of last element of sublisti to Sublist[i]

  14. Parallel List Ranking on GPUs: Steps • Step 4: Compute prefix sum of splitters, where the successor of a splitter is the next splitter encountered when traversing the list • This list is small • Hence can be done on CPU

  15. Parallel List Ranking on GPUs: Steps • Step 5: Update values of prefix sums computed in step 3 using splitter prefix sums of step 4 • This can be done using coalesced memory access – access by threads to contiguous locations

  16. Choosing s • Large values of s increase the chance of threads dealing with equal number of nodes • However, too large values result in overhead of sublist creation and aggregation

  17. Parallel Prefix on GPUs • Using binary tree • An upward reduction phase (reduce phase or up-sweep phase) • Traversing tree from leaves to root forming partial sums at internal nodes • Down-sweep phase • Traversing from root to leaves using partial sums computed in reduction phase

  18. Up Sweep

  19. Down Sweep

  20. Host Code • int main(){ • const unsigned intnum_threads = num_elements / 2; • /* cudaMallocd_idata and d_odata */ • cudaMemcpy( d_idata, h_data, mem_size, cudaMemcpyHostToDevice) ); • dim3 grid(256, 1, 1); dim3 threads(num_threads, 1, 1); • scan<<< grid, threads>>> (d_odata, d_idata); • cudaMemcpy( h_data, d_odata[i], sizeof(float) * num_elements, cudaMemcpyDeviceToHost • /* cudaFreed_idata and d_odata */ • }

  21. Device Code __global__ void scan_workefficient(float *g_odata, float *g_idata, int n) { // Dynamically allocated shared memory for scan kernels extern __shared__ float temp[]; intthid = threadIdx.x; int offset = 1; // Cache the computational window in shared memory temp[2*thid] = g_idata[2*thid]; temp[2*thid+1] = g_idata[2*thid+1]; // build the sum in place up the tree for (int d = n>>1; d > 0; d >>= 1) { __syncthreads(); if (thid < d) { intai = offset*(2*thid+1)-1; int bi = offset*(2*thid+2)-1; temp[bi] += temp[ai]; } offset *= 2; }

  22. Device Code // scan back down the tree // clear the last element if (thid == 0) temp[n - 1] = 0;   // traverse down the tree building the scan in place for (int d = 1; d < n; d *= 2) { offset >>= 1; __syncthreads();  if (thid < d) { int ai = offset*(2*thid+1)-1; int bi = offset*(2*thid+2)-1; float t = temp[ai]; temp[ai] = temp[bi]; temp[bi] += t; } } __syncthreads(); // write results to global memory g_odata[2*thid] = temp[2*thid]; g_odata[2*thid+1] = temp[2*thid+1]; }

  23. References • Fast and Scalable List Ranking on the GPU. ICS 2009. • Optimization of Linked List Prefix Computations on Multithreaded GPUs Using CUDA. IPDPS 2010.

More Related