1 / 18

Hybrid PC architecture

This article investigates the integration of fine-grained CPU-GPU coupling in multi-core systems, with a focus on improving performance and ease of programming. Topics covered include trends in hybrid PC architecture, the current limitations of CPU-GPU coupling, the differences between CPU and GPU cores, and the benefits of using a queue-based programming model. The aim is to make GPU cores first-class execution engines in a multi-core system and explore the potential of fine-grained interaction between cores.

jgentile
Download Presentation

Hybrid PC architecture

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Hybrid PC architecture Jeremy Sugerman Kayvon Fatahalian

  2. Trends • Multi-core CPUs • Generalized GPUs • Brook, CTM, CUDA • Tighter CPU-GPU coupling • PS3 • Xbox 360 • AMD “Fusion” (faster bus, but GPU still treated as batch coprocessor)

  3. CPU-GPU coupling • Important apps (game engines) exhibit workloads suitable for both CPU and GPU-style cores CPU Friendly IO AI/planning Collisions Adaptive algorithms GPU Friendly Geometry processing Shading Physics (fluids/particles)

  4. CPU-GPU coupling • Current: coarse granularity interaction • Control: CPU launches batch of work, waits for results before sending more commands (multi-pass) • Necessitates algorithmic changes • GPU is slave coprocessor • Limited mechanisms to create new work • CPU must deliver LARGE batches • CPU sends GPU commands via “driver” model

  5. Fundamentally different cores • “CPU” cores • Small number (tens) of HW threads • Software (OS) thread scheduling • Memory system prioritizes minimizing latency • “GPU” cores • Many HW threads (>1000), hardware scheduled • Minimize per-thread state (state kept on-chip) • shared PC, wide SIMD execution, small register file • No thread stack • Memory system prioritizes throughput (not clear: sync, SW-managed memory, isolation, resource constraints)

  6. GPU as a giant scheduler cmd buffer = on-chip queues data buffer IA VS 1-to-1 Off-chip buffers (data) GS 1-to-N (bounded) output stream RS 1-to-N (unbounded) PS 1-to-(0 or X) (X static) OM data buffer

  7. GPU as a giant scheduler VS/GS/PS IA RS Off-chip buffers (data) Thread scoreboard Processing cores Hardware scheduler command queue vertex queue primitive queue fragment queue OM On-chip queues (read-modify-write)

  8. GPU as a giant scheduler • Rasterizer (+ input cmd processor) is a domain specific HW work scheduler • Millions of work items/frame • On chip queues of work • Thousands of HW threads active at once • CPU threads (via API commands), GS programs, fixed function logic generate work • Pipeline describes dependencies • What is the work here? • Vertices • Geometric primitives • Fragments • In the future: Rays? Well defined resource requirements for each category.

  9. The project • Investigate making “GPU” cores first-class execution engines in multi-core system • Add: • Fine granularity interaction between cores • Processing work on any core can create new work (for any other core) • Hypothesis: scheduling work (actions) is key problem • Keeping state on-chip • Drive architecture simulation with interactive graphics pipeline augmented with raytracing

  10. Our architecture • Multi-core processor = some “CPU” + some “GPU” style cores • Unified system address space • “Good” interconnect between cores • Actions (work) on any core can create new work • Potentially… • Software-managed configurable L2 • Synchronization/signaling primitives across actions

  11. Need new scheduler • GPU HW scheduler leverages highly domain-specific information • Knows dependencies • Knows resources used by threads • Need to move to more general-purpose HW/SW scheduler, yet still do okay • Questions • What scheduling algorithms? • What information is needed to make decisions?

  12. Programming model = queues • Model system as a collection of work queues • Create work = enqueue • SW driven dispatch of “CPU” core work • HW driven dispatch of “GPU” core work • Application code does not dequeue

  13. Benefits of queues • Describe classes of work • Associate queues with environments • GPU (no gather) • GPU + gather • GPU + create work (bounded) • CPU • CPU + SW managed L2 • Opportunity to coalesce/reorder work • Fine-created creation, bulk execution • Describe dependencies

  14. Decisions • Granularity of work • Enqueue elements or batches? • “Coherence” of work (batching state changes) • Associate kernels/resources with queues (part of env)? • Constraints on enqueue • Fail gracefully in case of explosion • Scheduling policy • Minimize state (size of queues) • How to understand dependencies

  15. First steps • Coarse architecture simulation • Hello world = run CPU + GPU threads, GPU threads create other threads • Identify GPU ISA additions • Establish what information scheduler needs • What are the “environments” • Eventually drive simulation with hybrid renderer

  16. Evaluation • Compare against architectural alternatives • Multi-pass rendering (very coarse-grain) with domain-specific scheduler • Paper: “GPU” microarchitecture comparison with our design • Scheduling resources • On chip state / performance tradeoff • On chip bandwidth • Many-core homogenous CPU

  17. Summary • Hypothesis: Elevating “GPU” cores to first-class execution engines is better way to build hybrid system • Apps with dynamic/irregular components • Performance • Ease of programming • Allow all cores to generate new work by adding to system queues • Scheduling work in these queues is key issue (goal: keep queues on chip)

  18. Three fronts • GPU micro-architecture • GPU work creating GPU work • Generalization of DirectX 10 GS • CPU-GPU integration • GPU cores as first-class execution environments (dump the driver model) • Unified view of work throughout machine • Any core creates work for other cores • GPU resource management • Ability to correctly manage/virtualize GPU resources • Window manager

More Related