200 likes | 314 Views
Introduction to Parallel Rendering. Jian Huang, CS 594, Spring 2002. Parallel Rendering. Graphics rendering process is computationally intensive Parallel computation is a natural measure to leverage for higher performance Two levels of parallelism: Functional parallelism – pipelining
E N D
Introduction to Parallel Rendering Jian Huang, CS 594, Spring 2002
Parallel Rendering • Graphics rendering process is computationally intensive • Parallel computation is a natural measure to leverage for higher performance • Two levels of parallelism: • Functional parallelism – pipelining • Data parallelism – multiple results computed at the same time
Data Parallel Algorithms • A lot of taxonomies of categorizing parallel algorithms • Image space vs. object space • Shared memory architecture, distributed memory architecture • MPI, OpenMP, … • Need a uniform framework to study and understand parallel rendering
A Core Problem • To partition work and distribute them • Rendering requires the involved data to go with each work partition • How to collect the rendered results to form the final image • Intensive data communication • A landmark paper: “A sorting classification of parallel rendering”, Molner, et. al., IEEE CG&A’94.
The Rendering Process • The rendering pipeline: • Geometry processing: transformation, lighting, clipping • Rasterization: scan-conversion, shading, visibility • Parallel processing:
A New Perspective • Rendering as a sorting process: • Sort from object coordinates to screen coordinates • Use this concept to study computational and communication costs
Sorting • Parallel geometry processing: assign a subset of primitives or objects to each processor • Parallel rasterization processing: assign a sub-region of the screen to each processor • The key procedure: calculating the effect of each primitive on each pixel • Rendering is a ‘sorting’ process from each primitive into each screen pixel • This sort involves redistributing data among processors
Where does the ‘sort’ take place? • The location of this ‘sort’ determines the structure of the parallel algorithm • This ‘sort’ can take place during: • Geometry processing (sort-first) • Between geometry processing and rasterization (sort-middle) • Rasterization (sort-last) • Each different sort has distinct properties
Sort-First • Redistributing raw primitives
Sort-Middle • Redistributing screen primitives
Sort-Last • Redistributing fragments, samples, or pixels
Processing and Communication Model • A refined model • Assume a dataset containing nr raw primitives with average size ar. • We will call primitives that result from tessellation display primitives. If T is the tessellation ratio, there are nd= Tnrof these, with average size ad= ar /T. If there is no tessellation, T = 1, nd= nr, and ad= ar . • Assume an image containing A pixels and need to compute S samples per pixel. Assume that all primitives within the viewing frustum).
Analysis of Sort-First • c, proportion of primitives to be redistributed • Advantages • Low communication requirements when the tessellation ratio and the degree of oversampling are high, or when frame-to-frame coherence can be exploited. • Processors implement entire rendering pipeline for a portion of the screen. • Disadvantages • Susceptible to load imbalance. Primitives may clump into regions, concentrating the work on a few renderers. • To take advantage of frame-to-frame coherence, retained mode and complex data handling code are necessary. Cost over uni-processor rendering
Analysis of Sort-Middle • Advantages: General and straightforward; redistribution occurs at a natural place in the pipeline. • Disadvantages: • High communication costs if tessellation ratio is high. • Susceptible to load imbalance between rasterizers when primitives are distributed unevenly over the screen. Cost over uni-processor rendering
Analysis of Sort-Last • Sparse merging – only merge the small region each processor renders • Full-frame merging – always merge the whole frame-buffer • Advantages: • Renderers implement the full rendering pipeline and are independent until pixel merging. • Less prone to load imbalance. • SL-full merging can be embedded in a linear network, making it linearly scalable. • Disadvantage: Pixel traffic may be extremely high, particularly when oversampling. Cost over uni-processor rendering
A Comparison • Sort-first, sort-middle and sort-last. • There is no strictly the best category • Different categories can be combined in an implementation as well
Parallel Volume Rendering • A lot of algorithms choose to distribute data to processing nodes. Each node renders its portion of data, and generate a partial image. The partial images then get accumulated together. • Screen space partitioning (tiles or continuous scan-lines) are also used.
Load Balancing • For better load balancing, • Task queuing: the task queue can be ordered in decreasing task size, such that the concurrency gets finer until the queue is exhausted. • Load stealing: having nodes steal smaller tasks from other nodes, once they have completed their own tasks • Time stamp: timeout stamps used for each task, such that if the node can not finish its task before the timeout, it takes the remnant of the task, re-partitions it and re-distributes it. • Hierarchical data structures, such as octree, k-d tree, etc., are commonly used.
Sum It Up • Parallelism is just a tool • It depends on which algorithm you parallelize • What about an OpenGL application utilizing occlusion culling?