The Stanford Pervasive Parallelism Lab

The Stanford Pervasive Parallelism Lab A. Aiken, B. Dally, R. Fedkiw, P. Hanrahan, J. Hennessy, M. Horowitz, V. Koltun, C. Kozyrakis, K. Olukotun, M. Rosenblum, S. Thrun Pervasive Parallelism Laboratory Stanford University

The Looming Crisis • Software developers will soon face systems with • > 1 TFLOP of compute power • 20+ of cores, 100+ hardware threads • Heterogeneous cores (CPU+GPUs), app-specific accelerators • Deep memory hierarchies • Challenge: harness these devices productively • Improve performance, power, reliability and security • The parallelism gap • Yawning divide between the capabilities of today’s programming environments, the requirements of emerging applications, and the challenges of future parallel architectures

The Stanford Pervasive Parallelism Laboratory • Goal: the parallel computing platform for 2012 • Make parallel programming practical for the masses • Algorithms, programming models, runtimes, and architectures for scalable parallelism (10,000s of threads) • Parallel computing a core component of CS education • PPL is a combination of • Leading Stanford researchers across multiple domains • Applications, languages, software systems, architecture • Leading companies in computer systems and software • Sun, AMD, Nvidia, IBM, Intel, HP • An exciting vision for pervasive parallelism • Open laboratory; all result in the open-source

The PPL Team • Applications • Ron Fedkiw, Vladlen Koltun, Sebastian Thrun • Programming & software systems • Alex Aiken, Pat Hanrahan, Mendel Rosenblum • Architecture • Bill Dally, John Hennessy, Mark Horowitz, Christos Kozyrakis, Kunle Olukotun (Director)

The PPL Team • Research expertise • Applications: graphics, physics simulation, visualization, AI, robotics, … • Software systems: virtual machines, GPGPU, stream programming, transactional programming, speculative parallelization, optimizing compilers, bug detection, security,… • Architecture: multi-core & multithreading, scalable shared-memory, transactional memory hardware, interconnect networks, low-power processors, stream processors, vector processors, … • Commercial success • MIPS & SGI, Rambus, VMware, Niagara processors, Renderman, Stream Processors Inc, Avici, Tableau, …

Top down research App & developer needs drive system High-level info flows to low-level system Scalability In hardware resources (10, 000s threads) Developer productivity (ease of use) HW provides flexible primitives Software synthesizes complete solutions Build real, full system prototypes Guiding Principles

The PPL Vision Virtual Worlds Autonomous Vehicle Financial Services Rendering DSL Physics DSL Scripting DSL Probabilistic DSL Analytics DSL Parallel Object Language Common Parallel Runtime Explicit / Static Implicit / Dynamic Hardware Architecture SIMD Cores OOO Cores Threaded Cores Isolation & Atomicity Scalable Interconnects Partitionable Hierarchies Scalable Coherence Pervasive Monitoring

Existing Stanford research center Existing Stanford CS research groups Demanding Applications • Leverage domain expertise at Stanford • CS research groups & national centers for scientific computing • From consumer apps to neuroinformatics Environmental Science Media-X DOE ASC Seismic modeling PPL NIH NCBC Geophysics AI/ML Vision Graphics Games Mobile HCI Web, Mining Streaming DB

Virtual Worlds Application • Next-gen web platform • Immersive collaboration • Social gaming • Millions of players in vast landscape • Challenges • Client-side game engine • Server-side world simulation • AI, physics, large-scale rendering • Dynamic content, huge datasets • More at http://vw.stanford.edu/

Autonomous Vehicle Application • Cars that drive autonomously in traffic • Save lives & money • Improve highway throughput • Improve productivity • Challenges • Client-side sensing, perception, planning, & control • Server-side data merging, pre-processing, & post-processing, traffic control, model generation • Real-time, huge datasets • More at http://www.stanfordracing.org

Domain Specific Languages (DSL) • Leverage success of DSL across application domains • SQL (data manipulation), Matlab (scientific), Ruby/Rails (web),… • DSLs  higher productivity for developers • High-level data types & ops tailored to domain • E.g., relations, triangles, matrices, … • Express high-level intent without specific implementation artifacts • Programmer isolated from details of specific system • DSLs  scalable parallelism for the system • Declarative description of parallelism & locality patterns • E.g., ops on relation elements, sub-array being processed, … • Portable and scalable specification of parallelism • Automatically adjust data structures, mapping, and scheduling as systems scale up

DSL Research & Challenges • Goal: create the tools for DSL development • Initial DSL targets • Rendering, physics simulation, analytics, probabilistic computations • Challenges • DSL implementation  embed in base PL • Start with Scala (OO, type-safe, functional, extensible) • Use Scala as a scripting DSL that ties multiple DSLs • DSL-specific optimizations  telescoping compilers • Use domain knowledge to optimize & annotate code • Feedback to programmers  ? • …

Common Parallel Runtime (CPR) • Goals • Provide common, portable, abstract target for all DSLs • Write once, run everywhere model • Manages parallelism & locality • Achieve efficient execution (performance, power, …) • Handles specifics of HW system • Approach • Compile DSLs to common IR • Base language + low-level constructs & pragmas • Forall, async/join, atomic, barrier, … • Per-object capabilities • Read-only or write-only, output data, private, relaxed coherence, … • Combine static compilation + dynamic management • Explicit management of regular tasks & predictable patterns • Implicit management of irregular parallelism

CPR Research & Challenges • Integrating & balancing opposing approaches • Task-level & data-level parallelism • Static & dynamic concurrency management • Explicit & implicit memory management • Utilize high-level information from DSLs • The key to overcoming difficult challenges • Adapt to changes in application behavior, OS decisions, runtime constraints • Manage heterogeneous HW resources • Utilize novel HW primitives • To reduce overhead of communication, synchronization, … • To understand runtime behavior on specific HW & adapt to it

Hardware Architecture @ 2012 The many-core chip 100s of cores OOO, threaded, & SIMD Hierarchy of shared memories Scalable, on-chip network The system Few many-core chips Per-chip DRAM channels Global address space The data-center Cluster of systems TC OOO SIMD TC OOO SIMD TC TC TC TC TC TC L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L2 Memory L2 Memory I/O L3 Memory DRAM CTL L2 Memory L2 Memory L1 L1 L1 L1 L1 L1 TC OOO SIMD TC OOO SIMD

Architecture Challenges Heterogeneity Balance of resources, granularity of parallelism Support for parallelism & locality management Synchronization, communication, … Explicit Vs. implicit locality management Runtime monitoring Scalability On-chip/off-chip bandwidth & latency Scalability of key abstractions (e.g., coherence) Beyond performance Power, fault-tolerance, QoS, security, virtualization

Architecture Research Revisit architecture & micro-architecture for parallelism Define semantics & implementation of key primitives Communication, atomicity, isolation, partitioning, coherence, consistency, checkpoint Fine-grain & bulk support Software synthesizes primitives into execution systems Streaming system: partitioning + bulk communication Thread-level spec: isolation + fine-grain communication Transactional memory: atomicity + isolation + consistency Security: partitioning + isolation Fault tolerance: isolation + checkpoint + bulk communication Challenges: interactions, scalability, cost, virtualization

Architecture Research Software-managed HW primitives Exploit high-level knowledge from DSLs & CPR E.g., scale coherence using coarse-grain techniques Coarse-grain in time: force coherence only when needed Coarse-grain in space: object-based, selective coherence Support for programmability & management Fine-grain monitoring, HW-assisted invariants Build upon primitives for concurrency Efficient interface to CPR Scalable on-chip & off-chip interconnects High-radix network Adaptive routing

Research Methodology • Conventional approaches are still useful • Develop app & SW system on existing platforms • Multi-core, accelerators, clusters, … • Simulate novel HW mechanisms • Need some method that bridges HW & SW research • Makes new HW features available for SW research • Does not compromise HW speed, SW features, or scale • Allows for full-system prototypes • Needed for research, convincing for industry, exciting for students • Approach: commodity chips + FPGAs in memory system • Commodity chips: fast system with rich SW environment • FPGAs: prototyping platform for new HW features • Scale through cluster arrangement

Memory Memory Core 0 Core 1 Core 0 Core 1 Core 2 Core 3 Core 2 Core 3 Core 0 Core 1 Core 0 Core 1 Core 2 Core 3 Core 2 Core 3 Memory Memory FARM: Flexible Architecture Research Machine

Memory Memory Core 0 Core 1 Core 0 Core 1 Core 2 Core 3 Core 2 Core 3 Core 0 Core 1 FPGA Core 2 Core 3 SRAM Memory Memory FARM: Flexible Architecture Research Machine IO

Memory Memory Core 0 Core 1 Core 0 Core 1 Core 2 Core 3 Core 2 Core 3 FPGA SRAM Memory Memory FARM: Flexible Architecture Research Machine GPU/Stream IO

Memory Memory (scalable) Core 0 Core 1 Core 0 Core 1 Core 2 Core 3 Core 2 Core 3 Infiniband Or PCIe Interconnect Core 0 Core 1 FPGA Core 2 Core 3 SRAM Memory Memory FARM: Flexible Architecture Research Machine IO

Example FARM Uses • Software research • SW development for heterogeneous systems • Code generation & resource management • Scheduling system for large-scale parallelism • Thread state management & adaptive control • Hardware research • Scalable streaming & transactional HW • FPGA extends protocols throughout cluster • Scalable shared memory • FPGA provides coarse-grain tracking • Hybrid memory systems • Custom processors & accelerators • HW support for monitoring, scheduling, isolation, virtualization, …

Conclusions • PPL: a full system vision for pervasive parallelism • Applications, programming models, software systems, and hardware architecture • Key initial ideas • Domain-specific languages • Combine implicit & explicit management • Flexible HW features

The Stanford Pervasive Parallelism Lab

The Stanford Pervasive Parallelism Lab

Presentation Transcript

Parallelism

Parallelism

Parallelism

Parallelism

Parallelism

Parallelism:

parallelism

Parallelism

Parallelism

Parallelism

Parallelism

Parallelism

parallelism

Parallelism

Stanford University Learning Analytics Lab

Parallelism

Parallelism: Avoiding Faulty Parallelism

Parallelism

PARALLELISM PARALLELISM PARALLELISM

Parallelism

Parallelism

Parallelism