310 likes | 455 Views
Graphics on GRAMPS. Jeremy Sugerman Kayvon Fatahalian. Background. Context: Broader research investigation generalizing GPU/Cell/”compute” cores and combining them with CPUs. Fundamental Beliefs: Real data parallel apps still have performance critical non-data parallel pieces
E N D
Graphics on GRAMPS Jeremy Sugerman Kayvon Fatahalian
Background • Context: Broader research investigation generalizing GPU/Cell/”compute” cores and combining them with CPUs. • Fundamental Beliefs: • Real data parallel apps still have performance critical non-data parallel pieces • Existing parallel programming models are too constrained (GPUs) or too hard/vague (CPUs) • Queues are an excellent idiom to capture producer-consumer parallelism– thread and data • Fixed function execution units are not a problem, but fixed control paths are
Compute Cores • CPUs designed for single threads per core • Minimal FLOPS per core • Compute cores design for lots of math per core • Many “threads” per core • Sometimes wider SIMD per thread • SIMD width * # hardware threads ops / core • And, more compute than CPU cores fit per chip • Many examples: GPU, Cell, Niagara, Larrabee
Simplified Direct3D Pipeline • Application launches some drawing… • Vertex Assembly (Fixed, Non-Data Parallel) • Vertex Processing (Programmable, Data Parallel) • Primitive Assembly (Fixed, Non-Data Parallel) • Primitive Processing (Programmable, Data Parallel) • Fragment Assembly (Fixed, Non-Data Parallel) • Fragment Processing (Programmable, Data Parallel) • Pixel / Image Assembly (Fixed, Non-Data Parallel) • Only Data Parallel stages are programmable!
Direct3D Pipeline Properties • There is a reason only data parallel stages are programmable. • ‘Shader’ stages are inherently per-element (e.g. vertex / primitive / fragment) and stateless between them. • ‘Assembly’ stages also run on many elements, but they have inter-element dependencies • State can be remembered (vertex caching) • Inputs can be used by multiple outputs (strips) • Programmable ‘Assembly’ requires heavier (more serial) threads than ‘Shaders’.
Question • Can fixed-function control be decoupled from efficient graphics performance on a compute- heavy architecture? • Does not necessarily exclude fixed-function execution blocks (eg. rasterizer, texture units…)
This Talk • GRAMPS: Our current model for programming compute cores. • Implementing Direct3D 10 “in software” with GRAMPS. • (Potentially) thoughts about how REYES, ray tracers map to GRAMPS. • No explicit discussion of heterogeneous cores. • No fancy scheduling algorithms (yet?)
Example: Simple 3D Pipeline Input Vertices Vertex Shading Transformed Vertices Primitive Assembly Primitives Rasterize (Assemble) Fragments Fragment Shading Shaded Fragments Image Assembly Framebuffer Pixels
GRAMPS • General Runtime/Architecture for Multicore Parallel Systems • Models execution graph of queues connected by threads • Graph specified by host program • Simulator for exploring compute cores • Currently conflates “hardware” and runtime • # of cores, thread contexts, SIMD width are all parameters
Simple GRAMPS core • T - threads/core • S - SIMD ALUs/core • R - registers/thread • 1 thread runs in each clock • Threads issue vector instructions (think S-wide SSE) L1 data cache (or scratchpad) Thread 0 R Thread 1 Thread 2 … Thread T-1 ALU 0 ALU 1 ALU 2 ALU 3 ALU 4 ALU S-1 …
D3D10 Setup • App defines 3 shading environments • Vertex, geometry, fragment • Attach programs and resources • App configure fixed function units • Fixed number of “modes” • Attach resources • App submits work (vertices) to pipeline • Graphics runtime executes until completion
GRAMPS Setup • App defines a set of queues • App defines a set of thread environments • App attaches queues as thread inputs and outputs • App bootstraps computation by inserting data into queue • Runtime executes threads until completion
GRAMPS Entities: Execution • Threads: Assemble, Shader, Fixed • Assemble: Stateful, akin to a regular thread • Fixed: Special purpose hardware wrapped to appear an Assemble thread • Shader: Stateless and data parallel
GRAMPS Entities: Data • Queues for producer-consumer parallelism • Queues for aggregating coherent work • Queues support push and reserve/commit for in-place Assembly • Chunks are the units / granularity at which Queues are manipulated.
GRAMPS Scheduling • GRAMPS assigns Threads to hw contexts • Based on graph, current Queue contents • Tiered scheduling model • Tier-0: Trivially puts threads onto hw threads • Tier-1: Builds schedules for Tier-0. • Tier-N: Arbitrarily clever. Doesn’t exist.
D3D10 on GRAMPS postVtxShade queue Index queue idxVtxAssemble preVtxShade queue prePrimAssemble queue vtxShade primAssemble prePrimShade queue = shader thread primShade postPrimShade queue = assemble thread rastAssemble = fixed function in GPU preRast queue tri setup / clip / cull tri queue 0 tri queue 1 tri queue 2 tri queue N rasterize rasterize rasterize rasterize preFragShade queue preFragShade queue preFragShade queue preFragShade queue fragShade fragShade fragShade fragShade postFragShade queue postFragShade queue postFragShade queue postFragShade queue blend / ztest blend / ztest blend / ztest blend / ztest
Internal Queues • Queues just memory + state struct (see below) • For now: Queues are finite • Queues are contiguous array of chunks • Chunks = granularity of manipulation queue { BYTE ptr[num_chunks * chunk_byte_width]; int num_chunks; int chunk_byte_width; int head; int tail; int reclaim; bool done[num_chunks]; };
Ex: GRAMPS has chunks postVtxShade queue Index queue idxVtxAssemble preVtxShade queue vtxShade index_queue chunks contain vertex indices preVtxShade_queue chunks contain 16 pre-transformed vertices postVtxShade_queue chunks contain 16 transformed vertices
Ex: GRAMPS has chunks rasterize preFragShade queue fragShade preFragshade_queue chunks contain: Interpolated inputs for 16 fragments liveness mask per fragment x,y position per quad uniform data shared across all fragments
Queue API • Window = view into a contiguous range of chunks for assemble threads • Symmetric for producing/consuming access qwin { BYTE* ptr; int num; int id; }; • Shader threads just have “push”
Queue manipulation (All threads) void produce() “push” (Assemble shader only) qwin* reserve(qwin* q, int num_chunks) qwin* commit(qwin* q, int num_chunks)
Internal threads • Defines a “type” of thread ThreadEnv { type = {shader, assemble, fixed-func} Program Code uniforms/constant data sampler/texture/resource id bindings List of input queues List of output queues };
Shader threads • Shading language unchanged (HLSL) • Still write shaders in terms of single elements • Compilation produces code to operate on chunks void hlsl_likefn(const element* inputEl, element* outputEl, const sampler foo, const tex3d tex)
Internal shader threads • Shader thread code processes chunks • Input: • GRAMPS pre-reserved chunks from in/out queues • Environment info (uniforms, consts, etc) void shaderFn(const chunk* in_chunks[], chunk* out_chunks[], const env* env) • Dispatched shader threads run to completion • Completion implies: inChunks are released outChunks are commited
Assemble threads • Assemble threads build chunks • Access queue data via windows • Commit/reserve/consume may block thread void assembleFn(qwin* in_win[], qwin* out_win[], const env* env)
Ex: primitive assembly • Input chunks = 16 verts • Output chunks = 16 prims • Prim structure depends on type of prim • Points lines, triangles, triangle /w adj, etc • Creating prims from verts dependent on topology • Strips or lists • Triangle strip: data for output chunk comes from multiple input chunks prePrimAssemble queue primAssemble prePrimShade queue
Ex: frag assembly (rast) For (each input triangle) { Add triangle uniform data to chunk while (chunk not full && triangle not done) { rasterize next tile of quads… for (each nonempty quad) { add 4 fragments to chunk add quad description per chunk } } if (chunk is full) { qwin_out = commit(qwin_out, 1); grow window with reserve() if necessary… } } Building chunks: 1. Compact valid quads 2. Data at various frequencies
L1 $ T 0 T 1 T 2 T T-1 Execution: Tier 1 queue queue queue shader threadEnv assemble threadEnv assemble threadEnv shader threadEnv queue queue queue assemble threadEnv assemble threadEnv shader threadEnv shader threadEnv ShaderThr dispatch AssembleThr resume Tier 1 to Tier 0 FIFO Thread_Done() (implicit commit) Produce() Reserve() Commit()
Execution: Tier 0 • Each cycle: round robin runnable threads • Thread stalls: place on wait list • When thread completes: • Pull next thread from fifo, assign to empty thread slot • Send completion message to tier 0 Tier 1 to Tier 0 FIFO L1 data cache (or scratchpad) Thread 0 R Thread 1 Thread 2 … Tier 0 Scheduler Thread T-1 … ALU 0 ALU 1 ALU 2 ALU 3 ALU 4 ALU S-1
Validation • “Fat enough” cores for assemble threads can deliver sufficient FLOPS • Assemble threads can keep compute cores + fixed-function units busy • Can give up domain-specific heuristics in the scheduling