Afrigraph 2003 Course on Advanced Interactive Ray Tracing and Interactive Global Illumination

Afrigraph 2003 Course onAdvanced Interactive Ray TracingandInteractive Global Illumination Ingo Wald Carsten Benthin Philipp Slusallek Saarland University

Ray-Generation First: What is Ray Tracing ? Ray-Traversal Intersection Shading Framebuffer

Agenda • Introduction & Motivation • Why Interactive Ray Tracing at all ? • Part I – Interactive Ray Tracing Architectures • Software Ray Tracing • Ray Tracing on Programmable GPUs • Dedicated Ray Tracing Hardware • Part II – Advanced Ray Tracing Issues • Handling Dynamic Scenes • The OpenRT Interactive Ray Tracing API • Part III – New Applications • Industrial Application: Interactive Visualization of Car Headlights • Interactive Global Illumination • Summary and Conclusions Afrigraph 2003

Why Interactive Ray Tracing ?

We have NVidia – so what do we need Ray Tracing for ? • Because it is high quality… • Fully Programmable and Arbitrary Shading Operations • All operations performed in floating point • Flexibility: Can shoot arbitrary Rays • Shadows, reflections, refractions, … • Even suitable for global illumination • Simple Programming Model • No need for multiple passes or OpenGL ‘tricks’ • For indirect effect (like shadows): just shoot a ray ! • Automatic ‘correctness’ • No need for approximations (like reflection maps)  Ray Tracing is much more flexible and powerful rendering algorithm than ‘classical’ triangle rasterization Afrigraph 2003

We have NVidia – so what do we need Ray Tracing for ? • But not only that : It’s also efficient ! • Logarithmic scene complexity • Useful for increasingly complex scenes (“1 mtri, no problem !” …) • No multiple rendering passes • ‘Automatic’ Visibility Culling & Occlusion Culling • Hidden geometry not even touched … • Depth complexity not an issue • No overdraw, shading performed exactly once per ray • Very useful for increasingly costly shading • Small bandwidth requirements (if you do it right…) • Memory access coherence + culling + single shading + … Afrigraph 2003

We have NVidia – so what do we need Ray Tracing for ? To summarize: • … it’s highly flexible • … it’s high-quality • … it’s efficient • And: All of that combines automatically • Can do some of that sometimes in HW, but usually not all together Afrigraph 2003

“If its so good, then why isn’t it real ?” • 1.) Better asymptotic complexity, but huge constants • 1 ray ~ 1000 CPU-cycles • Runs on hardware that it doesn’t really fit to… • Uses only tiny fraction of today’s CPUs, no parallelism, … • Need many rays/sec for full interactivity • ~ 1Mpix/frame * 4-fold anitaliasing *25 frames/sec * 10 rays/pixel  One billion rays per second … • 2.) Graphics users don’t have the choice • Rasterization has highly sophisticated HW implementations  HW technology for rasterization 10 years ahead of RT HW… • There is no interactive ray tracing chip (yet), no matter the cost… • All applications are designed for OpenGL  There is no market for interactive ray tracing (really ?) • Still more money/time/effort spent on improving rasterization Afrigraph 2003

Why is there no Ray Tracing Hardware ? Because Graphics hardware evolved 20 years ago ! • And: Rasterization was the better choice back then… • Small scenes  (asymptotic) complexity doesn’t matter for small N • Large triangles • Coherence: incremental ops & interpolation, low bandwidth • Simple (integer-)operations, highly pipelined • FPU-requirements of ray tracing unthinkable 10 years ago… • No fragment ops except interpolation • Programmability not an issue  Very deep pipelines: no dependencies, no branches, no nothing, … • Can be built in HW very efficient, very fast, very cheap • Note: All of this is changing today ! • Eg today, GForce 3 already has more FPU power than any CPU… Afrigraph 2003

Todays State of the Art in Realtime Ray Tracing Software Implementations are slowly becoming available • Michael Muuss, Army Research Labs • Huge Cluster of SGI machines… • Parker et al, University of Utah • 32-128 CPU SGI Origin • Saarland University • 4 dual PIII’s in 2000, up to 24 dual Athlon 1800+ today Hardware Architectures are already beeing designed • SaarCOR (Schmittler et al., HWWS 2002) • Ray Tracing on Programmable GPUs (Purcell, SigGraph 2002) • Hybrid Software/GPU system (Hart, HWWS 2002) • Several alternatives for future realtime ray tracing • Can’t yet decide which is best, only know: “It’ll come” Afrigraph 2003

Todays State of the Art in Realtime Ray Tracing • Even today, IRT solves tasks that even high-end graphics hardware still cannot handle ! • Highly complex models (Muuss, Utah, Saarland [RW2001]) • High-quality Isosurface and Volume Visualization (Utah) • Shadows, reflections, arbitrary shading… [Saarland, Utah] • High-quality reflection simulation of car headlights [PGV2002] • Interactive Global Illumination [RW2002] Afrigraph 2003

Todays State of the Art- Some Snapshots Afrigraph 2003

Video

Part IDifferent Approaches toRealtime Ray Tracing

Different Approaches to Realtime Ray Tracing Basically three choices: • Pure Software Implementations • Today: Highly parallel • Shared Memory (Utah), or PC Clusters (Saarland) • Future: Single PC ? • Moore’s Law also holds for CPUs ! • Perhaps with streaming co-processors (e.g. “SSE++”) • Mixed SW/HW: RT on Programmable GPUs • Purcell et al., Standford • Converges to the ‘coprocessor’ approach • Pure HW • Dedicated RT hardware (Schmittler et al., SaarCOR) • Summarize all three approaches Afrigraph 2003

Alternative ISoftware Ray Tracing(examplary on the Saarland engine)

The OpenRT Interactive Ray Tracing Engine Features of OpenRT: • Highly efficient implementation of RT kernels • On a single Athlon MP 1800+ CPU: ~ 500.000-1.5 million rays per second for average models (100ktri – 1 Mtri) • Up to 10 million rps (rays/sec) range (no shading, simple scenes) • Sophisticated parallelization on cluster of PCs • Dynamic load-balancing • Using up to 24 dual-Athlon MP 1800+ or 25 dual P4 Xeon 2.4GHz • Dynamically loadable, fully programmable Shaders • Arbitrary c-code shading, arbitrary rays • Renderman-like Shading Language • Can handle dynamic scenes (later) • OpenGL-like API (later) Afrigraph 2003

Where does the speed come from ? Speed depends on several factors… • Using fastest available hardware • Fast CPUs, and many CPUs • Good algorithms – Avoid operations in the first place • Fast Intersection and Traversal (kd-trees) • Minimize Intersections and Trv-steps with high-quality BSPs • Just as important – Make sure you’re using your silicon correctly ! • Highly efficient implementation • Machine-dependent code, if necessary (SSE) Afrigraph 2003

Where does the speed come from ? Keep the Computational Units busy ! • Make CPU doesn’t stall • Avoiding pipeline stalls has top priority • Look at memory, caches and bandwidth !!! • Example: Cache miss during triangle intersection costs about 4 times as much as the computations themselves !!! • Packing, aligning, cache-friendly data layout, prefetching, … • But: no details here • Already covered that at Afrigraph 2001 • It’s not one single method, its more a principle Afrigraph 2003

Distributed Ray Tracing • One CPU still not fast enough • 1 Mray/sec is fast, but not enough • Need more CPUs  Cluster’s are cheap ($20k-$50k) • Many approaches: • Static vs dynamic load balancing • Object-space vs image-space vs ray-based task partitioning, … • Pixel-interleaved (load balancing) vs tiles (coherence) • … • Problem: Interactivity constraint • Have to finish whole frame in 1/10th of a second • Few time for sophisticated reordering/scheduling Afrigraph 2003

Distributed Ray Tracing Our approach (mostly Carsten Benthin) • Image-based task partitioning  Break image up into ‘tiles’ (usually 16x16 or 32x32) • Since API: Can dynamically change task partitioning scheme • Strongly varying workload  Need dynamic load balancing: Let clients ask for work … • Have to care about network-latencies • (10ms Network-latency = 10.000 rays !) • Highly efficient networking/communication code • Double-buffering, prefetching, packing, streaming, asynchronous sending and rendering, interleaving of different tasks, multithreading, … Afrigraph 2003

Distributed Ray TracingResults • Can efficiently use many CPUs • 32x32 tiles at 640x480 = 150 tiles  enough for many CPUs • Usually limiting factor: Pixels/second (not rays/sec) • Bandwidth limited at server: 640x480 at 10-15 frames/sec • For < 10 fps: Usually achieve 90-99% client utilization • Client bandwidth usually not an issue … (100Mbit) • Rendering Complexity helps ! • More costly tiles = better compute/BW ratio, less Pixels/sec • Can use more CPUs without hitting bandwidth limit • Doubling rays/pixel easier than doubling framerate • Framerate scales linearly only up to max framerate • But always scales linearly in rays/pixel • Better networking hardware would definitely help Afrigraph 2003

Realtime Ray TracingApproach IIRay Tracing on Programmable GPUs

Ray Tracing on Programmable GPUs Graphics Hardware today • GPUs are extremely powerful • Already more transistors than P4 • Full IEEE floating point ! • Many, many, many parallel FPU’s • Moore’s Law: Faster growth than for CPUs • GPUs become more and more programmable • First: ‘Register Combiners’ • Then: ‘Vertex Shaders’ • Programmable per vertex • linear interpolation inside the vertices • Today: ‘Pixel Shaders’, ‘Fragment Programs’ • Fully programmable for each fragment Afrigraph 2003

Ray Tracing on Programmable GPUs GPU programmability today: • Full IEEE • SIMD computations • Access to ‘memory’ (textures) in every instruction • Multiple indirections (pointer chasing) now possible • “dependent texture reads” • Still: Several restrictions • Conditionals, loops, recursion, dependent texture writes … • Typically programmed in ‘GPU-assembler’ • Most recent: High-level ‘meta’ languages • E.g. ‘CG’ (‘C’ for GPUs) Afrigraph 2003

Streaming Computations on Programmable GPUs Idea: Use GPU as streaming co-processor • Don’t use it for rasterizing at all… • Pixels form a ‘stream’ of elements • Apply small program (‘kernel’) for whole stream • Render screen-aligned quad with a fragment shader • Fragment program executed for each screen pixel • Each pixel operates on different data • Read data from textures • Screen-aligned textures : 1 texel for each pixel • Output to framebuffer : 1 ‘pixel’ for each fragment program • Feedback Loop: Copy framebuffer to textures • Future: Directly write into textures Afrigraph 2003

Ray Tracing on Programmable GPUs Screen aligned Quad Memory (Textures) Fragment Kernel (Fragment Shader) Data (Texels) Output Frame Buffer

Ray Tracing on Programmable GPUs Screen aligned Quad Memory (Textures) Fragment Kernel (Fragment Shader) Data (Texels) Output Frame Buffer Feedback !

Ray Tracing on Programmable GPUs Mapping Ray Tracing to the GPU • Use textures for the storing ‘variables’ • Ray: ‘origin’ and ‘direction’ 2D textures (3 floats each) • Hit: 2D texture (3 floats: u,v,id) • Vertices: 1D-texture of vertex positions (3 floats each) • Triangles: 1D-texture of vertex ids (1 float each) • Acceleration structure: e.g. 3D-texture for simple grid • Multiple indirections no problem • E.g. use triangle[i] as texture coordinate into vertex[] texture • Up to 4 indirections (grid  triangle list  triangle  vertex) Afrigraph 2003

Ray Tracing on Programmable GPUs Write ‘kernels’ for different ray tracing ops • Ray Generation • Get pixel position from texture coordinates • Somehow get camera settings (e.g. from quad color, or texture) • Compute corresponding ray • Write to ‘origin’, ‘direction’, ‘state’ textures • Triangle Intersection • Read triangle ID to be intersected from state • Get triangle vertices from textures • Intersect • Update state texture • Similar for traversal, triangle list intersection, shading, … Afrigraph 2003

Ray Tracing on Programmable GPUs • Have kernels for ray generation, traversal, intersection, etc. • Each ray is in exactly one ‘state’ • E.g. in ‘intersection’ state • Make sure only rays in ‘correct’ state are processed • E.g. apply intersection kernel only to rays in intersect state • Usual GL masking methods, e.g. stencil bits, early pixel kill etc.  Can generate overhead, but usually ok … • Fragment program can change state of ray • E.g. change from ‘traversal’ to ‘intersection’ in non-empty voxel • Combine different kernels by just calling them in turn • E.g. rendering an ‘intersection’ quad will do one intersection step (but only for rays in intersect state !) • Secondary rays rel. easy for ‘Shader’ kernel • Update origin&direction textures, go back to ‘traversal’ state… Afrigraph 2003

Ray Tracing on Programmable GPUs Results: • Easy to exploit parallelism in the GPU • Many more pixels than fragment pipelines • Comparable performance to single CPU • Even though its only a prototype implementation • Limited by fragment pipeline very soon… • Main Limitation • Fragment processing speed • Texture memory • Need many textures for each pixel • Also need to store whole scene in texture • Bandwidth • Number of different states must be small ! Afrigraph 2003

Ray Tracing on Programmable GPUs Additional limitations of current GPUs • Bandwidth problems due to missing loops • Often have to write data just to save it for next iteration • Overhead due to missing ‘write’ capability • Accuracy problems – no ints, all floats • E.g. rounding modes when reading IDs from a texture … • Problems due to missing ‘dependent writes’ • Many textures for input, but only one framebuffer for output • Need multiple passes computing more than 3 values per pix. • Each fragment shader writes to exactly one predetermined position • Hard to do recursive operations with that limitation • Kd-tree construction ? Afrigraph 2003

Ray Tracing on Programmable GPUs Ray tracing on GPUs in the future ? • Many limitations will (probably) change • Loops, branches, dependent writes, int textures, texture memory, early pixel kill … • Performance will increase faster than for CPUs  Might soon be faster, and similarly flexible, as ray tracing on a CPU ! Afrigraph 2003

Realtime Ray TracingApproach IIIDedicated Ray Tracing Hardware

Dedicated Ray Tracing Hardware • Relatively low efficiency when using GPU for RT • Many units not needed at all (rasterization, z-buffer, clipping, lighting, …) • Lots of overhead • Programmable units can never be as efficient as dedicated HW • Dedicated ray tracing HW should be more efficient • Building RT HW is feasible today • FPU power not a problem any more (see GForce3 FPU performance) • Die size/Nr of transistors not a problem any more • Main problem: Off-chip bandwidth ! • Already between chip and cache Afrigraph 2003

Dedicated Ray Tracing Hardware Bandwidth: Same problem as in SW • Approach in SW: Bandwidth reduction by Coherent Ray Tracing (packet traversal) • HW: Much larger packets (64x64 vs 2x2 !) • Much bigger bandwidth saving • Target realtime full-screen resolutions • Larger packet sizes not a problem  Lots of coherence • Avoiding overhead simple in HW • Much simpler than with SSE Afrigraph 2003

SaarCOR Architecture Features • Based on interactive software ray tracer • Exactly same data structures, … • KD-trees as accelleration structure • Pakets of rays to reduce bandwidth • Fixed OpenGL-like shading… • … plus shadow and reflection rays Goals: • Simple low bandwidth memory interface • Half the floating point requirements of GeForce3 • Achieves frame rates comparable to today’s gfxcards Afrigraph 2003

SaarCOR Architecture: System overview Afrigraph 2003

SaarCOR Architecture: Features • Scalable • Fully pipelined • Multi threading for latency hiding • Simple communication pattern (no routing) • Highly asynchronous Afrigraph 2003

SaarCOR – Current Status Simulation on register-transfer level • Core @ 533MHz, Memory 64 Bit @ 133 MHz (simple SD-RAM, no DDR!) • Each pipeline uses 36 FP-units • Standard SaarCOR: • 4 pipelines • 16 threads per pipe • 1 GB/s bandwidth to memory (!) • 272 KB for caches (!) • Four pipes ~ ½ FP-resources of GeForce 3 Afrigraph 2003

Issues On-chip memory of standard SaarCOR • Caches: 272 KB • RF for rays: 288 KB • RF for stack: 535 KB Register level simulations only Simple shading only Afrigraph 2003

Afrigraph 2003 Course on Advanced Interactive Ray Tracing and Interactive Global Illumination