1.17k likes | 1.6k Views
High Performance Sorting and Searching using Graphics Processors. Naga K. Govindaraju Microsoft Concurrency. Sorting and Searching. “I believe that virtually every important aspect of programming arises somewhere in the context of sorting or searching !” -Don Knuth. Sorting and Searching.
E N D
High Performance Sorting and Searching using Graphics Processors Naga K. GovindarajuMicrosoft Concurrency
Sorting and Searching “I believe that virtually every important aspect of programming arises somewhere in the context of sorting or searching!” -Don Knuth
Sorting and Searching • Well studied • High performance computing • Databases • Computer graphics • Programming languages • ... • Google map reduce algorithm • Spec benchmark routine!
Massive Databases • Terabyte-data sets are common • Google sorts more than 100 billion terms in its index • > 1 Trillion records in web indexed! • Database sizes are rapidly increasing! • Max DB sizes increases 3x per year (http://www.wintercorp.com) • Processor improvements not matching information explosion
CPU(3 GHz) AGP Memory(512 MB) CPU vs. GPU GPU (690 MHz) Video Memory(512 MB) 2 x 1 MB Cache System Memory(2 GB) PCI-E Bus(4 GB/s) GPU (690 MHz) Video Memory(512 MB)
Massive Data Handling on CPUs • Require random memory accesses • Small CPU caches (< 2MB) • Random memory accesses slower than even sequential disk accesses • High memory latency • Huge memory to compute gap! • CPUs are deeply pipelined • Pentium 4 has 30 pipeline stages • Do not hide latency - high cycles per instruction (CPI) • CPU is under-utilized for data intensive applications
Massive Data Handling on CPUs • Sorting is hard! • GPU a potentially scalable solution to terabyte sorting and scientific computing • We beat the sorting benchmark with GPUs and provide a scaleable solution
Graphics Processing Units (GPUs) • Commodity processor for graphics applications • Massively parallel vector processors • High memory bandwidth • Low memory latency pipeline • Programmable • High growth rate • Power-efficient
GPU: Commodity Processor Laptops Consoles Cell phones PSP Desktops
Graphics Processing Units (GPUs) • Commodity processor for graphics applications • Massively parallel vector processors • 10x more operations per sec than CPUs • High memory bandwidth • Better hides memory latency pipeline • Programmable • High growth rate • Power-efficient
Parallelism on GPUs Graphics FLOPS GPU – 1.3 TFLOPS CPU – 25.6 GFLOPS
Graphics Processing Units (GPUs) • Commodity processor for graphics applications • Massively parallel vector processors • High memory bandwidth • Better hides latency pipeline • Programmable • 10x more memory bandwidth than CPUs • High growth rate • Power-efficient
Low pipeline depth Graphics Pipeline 56 GB/s programmable vertex processing (fp32) vertex polygon setup, culling, rasterization setup polygon rasterizer Hides memory latency!! programmable per- pixel math (fp32) pixel per-pixel texture, fp16 blending texture Z-buf, fp16 blending, anti-alias (MRT) memory image
NON-Graphics Pipeline Abstraction programmable MIMD processing (fp32) data Courtesy: David Kirk,Chief Scientist, NVIDIA SIMD “rasterization” setup lists rasterizer programmable SIMD processing (fp32) data data fetch, fp16 blending data predicated write, fp16 blend, multiple output memory data
Graphics Processing Units (GPUs) • Commodity processor for graphics applications • Massively parallel vector processors • High memory bandwidth • Better hides latency pipeline • Programmable • High growth rate • Power-efficient
GPU Growth Rate CPU Growth Rate Exploiting Technology Moving Faster than Moore’s Law
Graphics Processing Units (GPUs) • Commodity processor for graphics applications • Massively parallel vector processors • High memory bandwidth • Better hides latency pipeline • Programmable • High growth rate • Power-efficient
GPUs for Sorting and Searching: Issues • No support for arbitrary writes • Optimized CPU algorithms do not map! • Lack of support for general data types • Cache-efficient algorithms • Small data caches • No cache information from vendors • Out-of-core algorithms • Limited GPU memory
Outline • Overview • Sorting and Searching on GPUs • Applications • Conclusions and Future Work
Sorting on GPUs • Adaptive sorting algorithms • Extent of sorted order in a sequence • General sorting algorithms • External memory sorting algorithms
Adaptive Sorting on GPUs • Prior adaptive sorting algorithms require random data writes • GPUs optimized for minimum depth or visible surface computation • Using depth test functionality • Design adaptive sorting using only minimum computations
N. Govindaraju, M. Henson, M. Lin and D. Manocha, Proc. Of ACM I3D, 2005 Adaptive Sorting Algorithm • Multiple iterations • Each iteration uses a two pass algorithm • First pass – Compute an increasing sequence M • Second pass - Compute the sorted elements in M • Iterate on the remaining unsorted elements
Increasing Sequence Given a sequence S={x1,…, xn}, an element xi belongs to M if and only if xi ≤ xj, i<j, xj in S
≤ Increasing Sequence X1 X2 X3… Xi-1 Xi Xi+1 … Xn-2 Xn-1 Xn M is an increasing sequence
Compute Increasing SequenceComputation X1 X2 … Xi-1 Xi Xi+1 … Xn-1 Xn
Compute Xn≤ ∞ Increasing SequenceComputation Xn
Compute Yes. Prepend xi to MMin = xi xi≤Min? Increasing SequenceComputation XiXi+1 … Xn-1 Xn
Compute Increasing SequenceComputation X1 X2 … Xi-1 Xi Xi+1 … Xn-1 Xn x1≤{x2,…,xn}?
Computing Sorted Elements Theorem 1:Given the increasing sequence M, rank of an element xi in M is determined if xi < min (I-M)
Computing Sorted Elements X1 X2 X3… Xi-1 Xi Xi+1 … Xn-2 Xn-1 Xn
≥ ≤ ≤ Computing Sorted Elements X1X3… XiXi+1 … Xn-2 Xn-1 X2 … Xi-1 … Xn
Computing Sorted Elements • Linear-time algorithm • Maintaining minimum
Compute Computing Sorted Elements X1X3… XiXi+1 … Xn-2 Xn-1 X2 … Xi-1 … Xn
Compute No. Update min Xi in M? Yes. Append Xito sorted list Xi≤ min? Computing Sorted Elements X1 X2 … Xi-1Xi
Compute Computing Sorted Elements X1 X2 … Xi-1 Xi Xi+1 … Xn-1Xn
Algorithm Analysis Knuth’s measure of disorder: Given a sequence I and its longest increasing sequence LIS(I), the sequence of disordered elements Y = I - LIS(I) Theorem 2: Given a sequence I and LIS(I), our adaptive algorithm sorts in at most (2 ||Y|| + 1) iterations
Pictorial Proof X1 X2 …Xl Xl+1 Xl+2 …Xm Xm+1 Xm+2 ...Xq Xq+1 Xq+2 ...Xn
2 iterations 2 iterations 2 iterations Pictorial Proof X1 X2 …XlXl+1 Xl+2 …XmXm+1 Xm+2 ...XqXq+1 Xq+2 ...Xn
≤ Example 8 1 2 3 4 5 6 7 9
Sorted Example 8 9 1 2 3 4 5 6 7
Advantages • Linear in the input size and sorted extent • Works well on almost sorted input • Maps well to GPUs • Uses depth test functionality for minimum operations • Useful for performing 3D visibility ordering • Perform transparency computations on dynamic 3D environments • Cons: • Expected time: O(n2 – 2 n√n) on random sequences
Video: Transparent PowerPlant • 790K polygons • Depth complexity ~ 13 • 1600x1200 resolution • NVIDIA GeForce 6800 • 5-8 fps
N. Govindaraju,N. Raghuvanshi and D. Manocha, Proc. Of ACM SIGMOD, 2005 General Sorting on GPUs • General datasets • High performance
General Sorting on GPUs • Design sorting algorithms with deterministic memory accesses – “Texturing” on GPUs • 56 GB/s peak memory bandwidth • Can better hide the memory latency!! • Require minimum and maximum computations – “Blending functionality” on GPUs • Low branching overhead • No data dependencies • Utilize high parallelism on GPUs
GPU-Based Sorting Networks • Represent data as 2D arrays • Multi-stage algorithm • Each stage involves multiple steps • In each step • Compare one array element against exactly one other element at fixed distance • Perform a conditional assignment (MIN or MAX) at each element location
2D Memory Addressing • GPUs optimized for 2D representations • Map 1D arrays to 2D arrays • Minimum and maximum regions mapped to row-aligned or column-aligned quads