1 / 48

Introduction to Realtime Ray Tracing Course 41

Introduction to Realtime Ray Tracing Course 41. Philipp Slusallek Peter Shirley Bill Mark Gordon Stoll Ingo Wald. Hardware for Realtime Ray Tracing. Custom Hardware for Realtime Ray Tracing Characteristics and requirements RPU Design and Implementation

gamba
Download Presentation

Introduction to Realtime Ray Tracing Course 41

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to Realtime Ray TracingCourse 41 Philipp Slusallek Peter Shirley Bill Mark Gordon Stoll Ingo Wald

  2. Hardware for Realtime Ray Tracing • Custom Hardware for Realtime Ray Tracing • Characteristics and requirements • RPU Design and Implementation • GPU + Recursion + Custom Traversal HW • Programming Model • FPGA Prototype • Performance and Scalability

  3. Ray Tracing on CPUs • Characteristics • Commodity, well understood HW • High FP performance, yet still too slow • Limited parallelism, bulky clusters • Poor silicon usage (e.g. cache) • Outlook • Multi-core designs are coming • Will still take too long

  4. Ray Tracing on GPUs • Characteristics • Very high raw FP performance • High degree of parallelism • Fast development cycle • Stream programming model • Still too limited for efficient ray tracing • No support for recursion • Limited memory access

  5. Ray Tracing Characteristics: kd-Tree Traversal • One-dimensional computation along ray • Compute location of d relative to t_min / t_max • Iterate or recurse with updated t_max / t_max t_max d t_max d t_max t_min t_min t_min split split d Near: t_min< t_max < d Both: t_min < d < t_max Far: d < t_min < t_max

  6. Ray Tracing Characteristics: kd-Tree Traversal t_max • Inner traversal loop tmp = node.split – ray.origin d = tmp * 1/ray.direction near = d > t_min far = d < t_max if (near & far) push(node.far, d, t_max) if (near) iterate(node.near, t_min, d) else iterate(node.far, d, t_max) • Advantages of using kd-trees • Simple and fast traversal & building algorithm • Robust & very good handling of large scenes d t_min split

  7. Ray Tracing Characteristics: kd-Tree Traversal • Traversal Processing • 50-80 k-D steps per ray @ 10 instructions/step many instructions  many clock cycles • Serial dependency  low pipeline efficiency, stalls, latency • Limited but flexible control flow and memory access  Custom HW unit • One clock tick per traversal step (fully pipelined) • Up to 100:1 improvement

  8. Ray Tracing Characteristics: Intersection • Intersection computation • Triggered by traversal at every leaf node • Called with: ray and address of geometry • Option 1: Custom hardware [SaarCOR’05] • Option 2: Software on programmable processor • Can be implemented efficiently • Enables arbitrary programmable primitives  Do not use costly dedicated hardware

  9. Ray Tracing Characteristics: Shading • Shading computation • Triggered by finished ray traversal • Called with: ray, hit point, shader-id, address of parameters • Characteristics: • General-purpose computation, many 3-/4-vectors • Needs support for efficient texture and memory access • Needs support for arbitrary recursive tracing rays • E.g. support dependent ray tracing  Main feature of ray tracing: Do not put limits on it

  10. Ray Tracing Characteristics: Coherence • Ray coherence • Neighboring primary rays • Traverse highly similar kd-node in same order • Often hit same geometric primitives • Often execute the same shader, access same textures, … • Similar for shadow rays to one light source • Often (but not always) applies for secondary rays  HW should take advantage of this coherence

  11. Previous Work • SaarCOR I • Fixed function ray tracing chip [GH’05]

  12. RPU Approach • Take GPUs as basis and core component • Highly parallel, highly efficient • Improve programming model • Add efficient recursion, conditionals • Add memory access options • Add custom traversal unit • Slave to RPU • Performs indirect, data dependent functions calls

  13. RPU Design • Shader Processing Units (SPU) • General purpose computation • For shading, geometry, lighting computations • Operates on 4-component vectors • Integer and float • Dual issue, split vector • GPU-like instruction set • Arbitrary read/write • Texture addressing mode • No texture filtering  SW

  14. RPU Design • Shader Processing Units (SPU) • Custom Ray Traversal Unit (TPU) • Efficient traversal of k-D trees • Communicates with SPU over dedicated registers

  15. RPU Design • Shader Processing Units (SPU) • Custom Ray Traversal Unit (TPU) • Multi-Threading • Increases usage of HW resources • Hides latency due to • Memory access • Instruction dependencies • Long traversal operations • Separate thread pool for SPU & TPU • Software scheduling (compiler) • No overhead for switching threads • Increases resources (mainly register file)

  16. RPU Design • Shader Processing Units (SPU) • Custom Ray Traversal Unit (TPU) • Multi-Threading • Chunking • SIMD execution (SPUs & TPUs) • Takes advantage of coherence • Reduces hardware complexity • Can combine of memory requests • Reduces external bandwidth • Must allow for incoherence • Chunks may split at conditionals • Inactive sub-chunk put on stack • Masked execution • Worst case: serial computation

  17. RPU Design • Shader Processing Units (SPU) • Custom Ray Traversal Unit (TPU) • Multi-Threading • Chunking • Mailbox Processing (MPU) • Per thread caching mechanism • Avoids multiple processing of same kd-tree entry (e.g. triangle) • 10x performance for some scenes

  18. RPU Architecture

  19. SPU Vector Registers • All registers have 4- component (float or integer) • R0 to R15: General registers • Index into a HW managed register stack • Allows for single-cycle function call • P0 to P15: shader parameters • I0 to I3: data read from memory • A = (A0,A1,A2,A3) • Memory addressing • ORG, DIR, ... • TPU communication registers

  20. Instruction Set of SPU • Short vector instruction set • mov, add, mul, mad, frac • dph2, dp3, dph3, dp4 • Input modifiers • Swizzeling, negation, masking • Multiply with power of 2 • Special operations (modifiers) • rcp, rsq, sat • Fast 2D texture lookups • texload, texload4x • Read from and write to memory • load, load4x, store • Ray traversal operation • trace • Conditional instructions (paired) • if <condition> jmp label • if <condition> call <fun> • If <condition> return • Dual issue (pairing) • 3/1 and 2/2 arithmetic splitting • Arithmetic + load • Arithmetic + conditional jump, call, return

  21. Instruction Set of SPU • Short vector instruction set • mov, add, mul, mad, frac • dph2, dp3, dph3, dp4 • Input modifiers • Swizzeling, negation, masking • Multiply with power of 2 • Special operations (modifiers) • rcp, rsq, sat • Fast 2D texture lookups • texload, texload4x • Read from and write to memory • load, load4x, store • Ray traversal operation • trace • Conditional instructions (paired) • if <condition> jmp label • if <condition> call <fun> • If <condition> return • Dual issue (pairing) • 3/1 and 2/2 arithmetic splitting • Arithmetic + load • Arithmetic + conditional jump, call, return

  22. Instruction Set of SPU • Short vector instruction set • mov, add, mul, mad, frac • dph2, dp3, dph3, dp4 • Input modifiers • Swizzeling, negation, masking • Multiply with power of 2 • Special operations (modifiers) • rcp, rsq, sat • Fast 2D texture lookups • texload, texload4x • Read from and write to memory • load, load4x, store • Ray traversal operation • trace • Conditional instructions (paired) • if <condition> jmp label • if <condition> call <fun> • If <condition> return • Dual issue (pairing) • 3/1 and 2/2 arithmetic splitting • Arithmetic + load • Arithmetic + conditional jump, call, return

  23. Instruction Set of SPU • Short vector instruction set • mov, add, mul, mad, frac • dph2, dp3, dph3, dp4 • Input modifiers • Swizzeling, negation, masking • Multiply with power of 2 • Special operations (modifiers) • rcp, rsq, sat • Fast 2D texture lookups • texload, texload4x • Read from and write to memory • load, load4x, store • Ray traversal operation • trace • Conditional instructions (paired) • if <condition> jmp label • if <condition> call <fun> • If <condition> return • Dual issue (pairing) • 3/1 and 2/2 arithmetic splitting • Arithmetic + load • Arithmetic + conditional jump, call, return

  24. Instruction Set of SPU • Short vector instruction set • mov, add, mul, mad, frac • dph2, dp3, dph3, dp4 • Input modifiers • Swizzeling, negation, masking • Multiply with power of 2 • Special operations (modifiers) • rcp, rsq, sat • Fast 2D texture lookups • texload, texload4x • Read from and write to memory • load, load4x, store • Ray traversal operation • trace • Conditional instructions (paired) • if <condition> jmp label • if <condition> call <fun> • If <condition> return • Dual issue (pairing) • 3/1 and 2/2 arithmetic splitting • Arithmetic + load • Arithmetic + conditional jump, call, return

  25. Instruction Set of SPU • Short vector instruction set • mov, add, mul, mad, frac • dph2, dp3, dph3, dp4 • Input modifiers • Swizzeling, negation, masking • Multiply with power of 2 • Special operations (modifiers) • rcp, rsq, sat • Fast 2D texture lookups • texload, texload4x • Read from and write to memory • load, load4x, store • Ray traversal operation • trace • Conditional instructions (paired) • if <condition> jmp label • if <condition> call <fun> • If <condition> return • Dual issue (pairing) • 3/1 and 2/2 arithmetic splitting • Arithmetic + load • Arithmetic + conditional jump, call, return

  26. Ray Triangle IntersectionUnit-Triangle Test ; barycentric coordinates mad R8.xy,R8.z,R7,R6 + if or xy (<0 or >=1) return ; hit if u + v < 1 add R8.w,R8.x,R8.y + if w >=1 return ; hit distance closer than last one? add R8.w,R8.z,-R4.z + if w >=0 return ; save hit information mov SID,I3.x + mov MAX,R8.z mov R4.xyz,R8 + return ; load triangle transformation load4x A.y,0 ; transform ray dp3_rcp R7.z,I2,R3 dp3 R7.y,I1,R3 dp3 R7.x,I0,R3 dph3 R6.x,I0,R2 dph3 R6.y,I1,R2 dph3 R6.z,I2,R2 ; compute hit distance mul R8.z,-R6.z,S.z + if z <0 return Input Arithmetic (dot products) Multi-issue (arith. & cond.)

  27. Shader Processing UnitPipelining Read Instruction mov R0,R1 * mov R2,R3 * mov R0,R2 Read 3 Source Registers Swizzeling Memory Access * * * * + + + + Thread Control Clamp Branching RCP, RSQ Masking StackControl Writeback I0 – I3 Writeback Masking Writeback

  28. RPU Programming Model Light Source Shader Light Source Shader • ↨: Direct function calls • ↔: Indirect function calls via TPU TPU/ MPU Lighting Shader shadow rays ... TPU/ MPU secondaryrays Surface/ BRDF Shader ... SPU Processing TPU / MPU Processing TPU/ MPU TPU/ MPU Geometry Intersector Top-Level Object Intersector Primary Ray Shader primary ray

  29. RPU Programming Model Light Source Shader Light Source Shader TPU/ MPU Lighting Shader shadow rays TPU/ MPU secondaryrays Surface/ BRDF Shader TPU/ MPU TPU/ MPU Geometry Intersector Top-Level Object Intersector Primary Ray Shader primary ray

  30. RPU Programming Model Light Source Shader Light Source Shader • Threads are started for each pixel • Registers initialized from an input stream • 2D Hilbert curve generator sampling the screen • Memory stream for multi-pass • Shader computes ray TPU/ MPU Lighting Shader shadow rays TPU/ MPU secondaryrays Surface/ BRDF Shader TPU/ MPU TPU/ MPU Geometry Intersector Top-Level Object Intersector Primary Ray Shader primary ray

  31. RPU Programming Model Light Source Shader Light Source Shader • Threads are started • Registers initialized from an input stream • 2D Hilbert curve generator sampling the screen • Memory stream for multi-pass TPU/ MPU Lighting Shader shadow rays TPU/ MPU secondaryrays Surface/ BRDF Shader TPU/ MPU TPU/ MPU Geometry Intersector Top-Level Object Intersector Primary Ray Shader primary ray

  32. RPU Programming Model Light Source Shader Light Source Shader • Shooting Primary Rays • Ray traversal performed onthe TPU • Started in top-level kd-tree • Intersector transforms ray into local coordinate system TPU/ MPU Lighting Shader shadow rays TPU/ MPU secondaryrays Surface/ BRDF Shader top-level kd-tree TPU/ MPU TPU/ MPU Geometry Intersector Top-Level Object Intersector Primary Ray Shader primary ray

  33. RPU Programming Model Light Source Shader Light Source Shader TPU/ MPU Lighting Shader shadow rays TPU/ MPU secondaryrays Surface/ BRDF Shader top-level kd-tree TPU/ MPU TPU/ MPU Geometry Intersector Top-Level Object Intersector Primary Ray Shader primary ray

  34. RPU Programming Model Light Source Shader Light Source Shader TPU/ MPU Lighting Shader shadow rays TPU/ MPU secondaryrays Surface/ BRDF Shader top-level kd-tree TPU/ MPU TPU/ MPU Geometry Intersector Top-Level Object Intersector Primary Ray Shader primary ray

  35. RPU Programming Model Light Source Shader Light Source Shader • Shooting Primary Rays (II) • Transformed ray traversed through object kd-tree on TPU • Geometry intersection performed on programmable SPU • Programmable geometry: triangles, spheres, bicubic splines, quadrics, … TPU/ MPU Lighting Shader shadow rays TPU/ MPU secondaryrays Surface/ BRDF Shader object-level kd-tree TPU/ MPU TPU/ MPU Geometry Intersector Top-Level Object Intersector Primary Ray Shader primary ray

  36. RPU Programming Model Light Source Shader Light Source Shader TPU/ MPU Lighting Shader shadow rays TPU/ MPU secondaryrays Surface/ BRDF Shader object-level kd-tree TPU/ MPU TPU/ MPU Geometry Intersector Top-Level Object Intersector Primary Ray Shader primary ray

  37. RPU Programming Model Light Source Shader Light Source Shader • Surface shading performed on programmable SPU • Surface shader is called directly from primary shader • Arguments passed on HW stack • May trace secondary rays at any time: reflection, refraction, … • Writing shaders is easy due to global access to the scene and physically-based computation TPU/ MPU Lighting Shader shadow rays TPU/ MPU secondaryrays Surface/ BRDF Shader TPU/ MPU TPU/ MPU Geometry Intersector Top-Level Object Intersector Primary Ray Shader primary ray

  38. RPU Programming Model Light Source Shader Light Source Shader • Light properties and illumination can be abstracted using function calls • Illumination shader iterates over all light sources • For each light source a Light source shader is called TPU/ MPU Lighting Shader shadow rays TPU/ MPU secondaryrays Surface/ BRDF Shader TPU/ MPU TPU/ MPU Geometry Intersector Top-Level Object Intersector Primary Ray Shader primary ray

  39. Prototype Implementation

  40. PrototypePerformance • FPGA prototype • Xilinx Virtex II 6000 • 128 MB DDR-RAM at 350 MB/s • PCI bus for up-/download (no VGA) • Single RPU at only 66 MHz • Up to 4 million rays per second • Up to 20 fps @ 512x384 • Same ray tracing performance as Intel P4 @ 2.66 GHz

  41. Scalability • Larger Chunk Size • Less ray coherence • More data is accessed • Increased cache bandwidth • Larger caches

  42. Scalability • Larger Chunk Size • Multiple RPUs on a Chip • Limited by • VLSI technology • Memory bandwidth • FPGA prototype versus current GPUs • Floating point units 50x • Memory bandwidth 100x • Clock rate 7x

  43. Scalability • Larger Chunk Size • Multiple RPUs on a Chip • Multiple chips on a board • Fast interconnect for data exchange • Cache sizes accumulate • Managed through virtual memory [Schmittler’2003] • Limited through external bandwidth due to scene changes

  44. Scalability • Larger Chunk Size • Multiple RPUs on a Chip • Multiple chips on a board • Multiple boards in a PC • Similar to today’s PC clusters in a much smaller form factor

  45. Video

  46. Future Work • Support for fully dynamic scenes • Vertex shader + building kd-trees • Efficient photon mapping • kd-tree construction + kNN filtering • OpenRT-API [Dietrich’03] • ASIC prototype

  47. Questions? http://graphics.cs.uni-sb.de http://www.OpenRT.de http://www.SaarCOR.de

More Related