1 / 22

Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems

Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems. Janghaeng Lee Mehrzad Samadi Yongjun Park and Scott Mahlke. University of Michigan - Ann Arbor. Heterogeneity. Computer systems have become heterogeneous Laptops, Servers Mobile devices

wayne-neal
Download Presentation

Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Transparent CPU-GPU Collaboration forData-Parallel Kernels on Heterogeneous Systems JanghaengLee MehrzadSamadi Yongjun Park and Scott Mahlke University of Michigan - Ann Arbor

  2. Heterogeneity • Computer systems havebecome heterogeneous • Laptops, Servers • Mobile devices • GPUs integrated withCPUs • Discrete GPUs added for • Gaming • Supercomputing • Co-processors for massive data-parallel workload • Intel Xeon PHI

  3. Example of Heterogeneous System Core Core Core Core External GPUs GPU Shared L3 Memory Controller Host Faster GPUs Slower GPUs GPU Cores(> 16 cores) Multi CoreCPU (4-16 cores) GPU Cores(< 300 Cores) GPU Cores(< 100 Cores) < 20GB/s > 150GB/s > 100GB/s Memory Global Memory Global Memory PCIe < 16GB/s

  4. Typical Heterogeneous Execution GPU 0 CPU GPU 1 Seq.Code TransferInput KernelRun onGPU 1 IDLE IDLE TransferOutput Time

  5. Collaborative Heterogeneous Execution GPU 0 CPU GPU 1 GPU 0 CPU GPU 1 Seq.Code Seq.Code TransferInput TransferInput • 3 Issues • How to transform the kernel • How to merge output efficiently • How to partition Run on CPU Run on GPU 0 Run on GPU 1 KernelRun onGPU 1 IDLE IDLE TransferOutput Merge Seq.Code TransferOutput Speedup Time

  6. OpenCL Execution Model OpenCL Data Parallel Kernel Work-item Work-group PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE … … Compute Unit (CU) Compute Unit (CU) … … … … Compute Device 1 Compute Unit (CU) Compute Unit (CU) … … Shared Memory Constant Memory Global Memory

  7. Virtualizing Computing Device OpenCL Data Parallel Kernel Work-item Work-group Scheduler Scheduler Scheduler Scheduler NVIDIA SMX 0 1 2 3 0 0 0 1 1 1 2 2 2 3 3 3 4 5 6 7 4 4 4 5 5 5 6 6 6 7 7 7 Executeswork-item 8 9 10 11 8 8 8 9 9 9 10 10 10 11 11 11 12 13 14 15 12 12 12 13 13 13 14 14 14 15 15 15 Regs Shared Mem

  8. Virtualization on CPU Work-item Work-group Core 0 Core 1 Core 2 Core 3 ALU ALU Control ALU ALU ALU ALU ALU ALU Control Control Control ALU ALU ALU ALU ALU ALU ALU ALU Executesa work-item Register Register Register Register Cache Cache Cache Cache

  9. Collaborative Execution OpenCL Kernel Flatten Workgroups Host Memory (Global) Device Memory (Global) Merge Output COPY Input

  10. OpenCL - Single Kernel Multiple Devices (SKMD) Application Binary • Key Ideas • Kernel Transform • Work on subset of WGs • Efficiently merge outputsin different address spaces • Buffer Management • Manages working setfor each device • Partition • Assign optimal workloadfor each device OpenCL API num_groups(2) Launch Kernel num_groups(1) WG-VariantProfile Data num_groups(0) Partitioner KernelTransformer BufferManager SKMD Framework CPU Device (Host) Faster GPUs Slower GPUs GPU Cores(> 16 cores) Multi CoreCPU (4-16 cores) GPU Cores(< 300 Cores) GPU Cores(< 100 Cores) PCIe < 16GB/s

  11. Kernel Transform • Partial Work-group Execution __kernel voidoriginal_program( ... ) { } , intwg_from, intwg_to) partial_program num_groups(2) num_groups(1) intidx = get_group_id(0);intidy = get_group_id(1); intsize_x = get_num_groups(0); intflat_id = idx + idy * size_x; num_groups(0) [KERNEL CODE] FlattenN-DimensionalWorkgroups if (flat_id < wg_from || flat_id > wg_to) return;

  12. Buffer Management – Classifying Kernel Contiguous Memory Access Kernel Input Address Read Work-groups 33% 50% 17% CPU Device (Host) Faster GPUs Slower GPUs Write Output Address Easy to merge output in separate address space Discontiguous Memory Access Kernel Input Address Read Work-groups 33% 50% 17% Write Output Address Hard to Merge

  13. Merging Outputs • Intuition • CPU would produce the same as GPU’s if it executed rest of work-groups • Already has rest of results from GPU • Simple solution for merging kernel • Enable work-groups that were enabled in GPUs • Replace global store value with copy (load from GPU result) • Log output location &check log locationwhen merging • BAD idea • Additional overhead

  14. Merge Kernel Transformation kernel voidpartial_program ( ... , __global float *output,intwg_from, intwg_to) {intflat_id = idx + idy * size_x; if (flat_id < wg_from || flat_id > wg_to) return; // kernel body for (){ ... sum += ...; } output[tid] = sum; } merge_program , float *gpu_out) Merging Cost Done in Memory B/W  ≈ 20 GB /s  < 0.1 ms in most applications Removed by Dead Code Elimination gpu_out[tid]; Store to global memory

  15. Partitioning • Uses profile data • Done in Runtime • Must be fast • Uses Decision Tree Heuristics (Greedy) • Start from the root node assuming • All workgroups are assigned to the fastest device • Fixed # of workgroups are offloaded from the fastest device to another from the parent’s node

  16. Decision Tree Example • At each node, also considers • Balancing Factor • Do not choose a child that has less balanced execution time between devices • Data Transfer Time • Merging Costs • Performance Variation on # of workgroups f(Dev1, Dev2, Dev3) f(256,0,0) =200ms f(255,1,0) f(255,0,1) =195ms =197ms f(254,2,0) f(254,1,1) =195ms =193ms f(253,2,1) f(253,1,2) =192ms =190ms

  17. Experimental Setup

  18. Results Intel Xeon Only

  19. Results Intel Xeon Only 29%

  20. Results (Break Down) Vector Add MatrixMultiplication

  21. Summary • Systems have been become more heterogeneous • Configured with several types of devices • Existing CPU + GPU heterogeneous • Single device executes a single kernel • Single Kernel Multiple Devices (SKMD) • CPUs and GPUs working on a single kernel • Transparent Framework • Partial kernel execution • Merging partial output • Optimal partitioning • Performance improvement of 29% over single device execution

  22. Q & A

More Related