1 / 21

Real-time Ray Tracing on GPU with BVH-based Packet Traversal

Real-time Ray Tracing on GPU with BVH-based Packet Traversal. Stefan Popov, Johannes Günther, Hans-Peter Seidel, Philipp Slusallek. Background. GPUs attractive for ray tracing High computational power Shading oriented architecture GPU ray tracers Carr – the ray engine

dayton
Download Presentation

Real-time Ray Tracing on GPU with BVH-based Packet Traversal

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Real-time Ray Tracing on GPU with BVH-based Packet Traversal Stefan Popov, Johannes Günther, Hans-Peter Seidel, Philipp Slusallek

  2. Background • GPUs attractive for ray tracing • High computational power • Shading oriented architecture • GPU ray tracers • Carr – the ray engine • Purcell – Full ray tracing on the GPU, based on grids • Ernst – KD trees with parallel stack • Carr, Thrane & Simonsen – BVH • Foley, Horn, Popov – KD trees - stackless traversal

  3. Motivation • So far • Interactive RT on GPU, but • Limited model size • No dynamic scene support • The G80 – new approach to the GPU • High performance general purpose processor with graphics extensions • PRAM architecture • BVH allow for • Dynamic/deformable scenes • Small memory footprint • Goal: Recursive ordered traversal of BVH on the G80

  4. GPU Architecture (G80) • Multi-threaded scalar architecture • 12K HW threads • Threads cover latencies • Off-chip memory ops • Instruction dependencies • 4 or 16 cycles to issue instr. • 16 (multi-)cores • 8-wide SIMD • 128 scalar cores in total • Cores process threads in 32 wide SIMD chunks … Chunk Pool Chunk Pool … … … … … … … … Thread 1 Thread 1 Thread 1 Thread 1 Thread 1 Thread 1 Thread 1 Thread 1 Thread 32 Thread 32 Thread` 32 Thread 32 Thread 32 Thread` 32 Thread 32 Thread 32 Multi-Core 1 Multi-Core 16 IP IP Thread 1 Thread 1 … … Thread 32 Thread 32

  5. GPU Architecture (G80) • Scalar register file (8K) • Partitioned among running threads • Shared memory (16KB) • On-chip, 0 cycle latency • On-board memory (768MB) • Large latency (~ 200 cycles) • R/W from within thread • Un-cached • Read-only L2 cache (128KB) • On chip, shared among all threads Multi-Core 1 Multi-Core 16 Thread 1 Registers … … Shared Memory Thread 32 Registers L2 Cache (128KB) On-board memory

  6. Programming the G80 • CUDA • C based language with parallel extensions • GPU utilization at 100% only if • Enough threads are present (>> 12K) • Every thread uses less than 10 registers and 5 words (32 bit) of shared memory • Enough computations per transferred word of data • Bandwidth << computational power • Adequate memory access pattern to allow read combining

  7. Performance Bottlenecks • Efficient per-thread stack implementation • Shared memory too small – will limit parallelism • On-board memory – uncached • Need enough computations between stack ops • Efficient memory access pattern • Use texture caches • However, only few words of cache / thread • Read successive memory locations in successive threads of a chunk • Single roundtrip to memory (read combining) • Cover latency with enough computations

  8. Ray Tracing on the G80 • Map each ray to one thread • Enough threads to keep the GPU busy • Recursive ray tracing • Use per-thread stack stored on on-board memory • Efficient, since enough computations are present • But how to do the traversal ? • Skip pointers (Thrane) – no ordered traversal • Geometric images (Carr) – single mesh only • Shared stack traversal

  9. SIMD Packet Traversal of BVH • Traverse a node with the whole packet • At an internal node: • Intersect all rays with both children and determine traversal order • Push far child (if any) on a stack and descend to the near one with the packet • At a leaf: • Intersect all rays with contained geometry • Pop next node to visit from the stack

  10. PRAM Basics • The PRAM model • Implicitly synchronized processors (threads) • Shared memory between all processors • Basic PRAM operations • Parallel OR in O(1) • Parallel reduction in O(log N) false true false true false true 12 32 11 9 + + 11 9 44 20 + 20 11 9 64

  11. PRAM Packet Traversal of BVH • The G80 – PRAM machine on chunk level • Map packet  chunk, ray  thread • Threads behave as in the single ray traversal • At leaf: Intersect with geometry. Pop next node from stack • At node: Decide which children to visit and in what order. Push far child • Difference: • How rays choose which node to visit first • Might not be the one they want to

  12. PRAM Packet Traversal of BVH • Choose child traversal order • PRAM OR to determine if all rays agree on visiting the same node first • The result is stored in shared memory • In case of divergence: choose child with more ray candidates • Use PRAM SUM on +/- 1 for each thread, -1  left node • Look at result’s sign • Guarantees synchronous traversal of BVH

  13. PRAM Packet Traversal of BVH • Stack: • Near & far child – the same for all threads => store once • Keep stack in shared memory. Only few bits per thread! • Only Thread 0 does all stack ops. • Reading data: • All threads work with the same node / triangle • Sequential threads bring in sequential words • Single load operation. Single round trip to memory • Implementable in CUDA

  14. Results

  15. Analysis • Coherent branch decisions / memory access • Small footprint of the data structure • Can trace up to 12 million triangle models • Program becomes compute bound • Determined by over/under-clocking the core/memory • No frustums required • Good for secondary rays, bad for primary • Can use rasterization for primary rays • Implicit SIMD – easy shader programming • Running on a GPU – shading “for free”

  16. Dynamic Scenes • Update parts / whole BVH and geometry on GPU • Use GPU for RT and CPU for BVH construction / refitting • Construct BVH using binning • Similar to Wald RT07 / Popov RT06 • Bin all 3 dimensions using SIMD • Results in > 10% better trees • Measured as SAH quality, not FPS • Speed loss is almost negligible

  17. Results

  18. Conclusions • New recursive PRAM BVH traversal algorithm • Very well suited for the new generation of GPUs • No additional pre-computed data required • First GPU ray tracer to handle large models • Previous implementations were limited to < 300K • Can handle dynamic scenes • By using the CPU to update the geometry / BVH

  19. Future Work • More features • Shaders, adaptive anti-aliasing, … • Global illumination • Code optimizations • Current implementation uses too many registers

  20. Thank you!

  21. CUDA Hello World __global__ voidaddArrays(int *arr1, int *arr2) { unsigned t = threadIdx.x + blockIdx.x * blockDim.x; arr1[t] += arr2[t]; } int main() { int *inArr1 = malloc(4194304), *inArr2 = malloc(4194304); int *ta1, *ta2; cudaMalloc((void**)&ta1, 4194304); cudaMalloc((void**)&ta2, 4194304); for(inti = 0; i < 4194304; i++) { inArr1[i] = rand(); inArr2[i] = rand(); } cudaMemcpy(ta1, inArr1, 4194304, cudaMemcpyHostToDevice); cudaMemcpy(ta2, inArr2, 4194304, cudaMemcpyHostToDevice); addArrays<<<dim3(4194304 / 512, 1, 1), dim3(512, 1, 1)>>>(ta1, ta2); cudaMemcpy(inArr1, ta1, 4194304, cudaMemcpyDeviceToHost); for(inti = 0; i < 4194304; i++) printf("%d ", inArr1[i]); return 0; }

More Related