850 likes | 1.05k Views
Advanced Structure of Programming Languages. Next Generation Parallel and Distributed Languages. Problem Examples. Simulations of the earth’s climate Resolution: 10 kilometers, Period: 1 year, Ocean and biosphere models: simple
E N D
Advanced Structure of Programming Languages Next Generation Parallel and Distributed Languages
Problem Examples • Simulations of the earth’s climate • Resolution: 10 kilometers, Period: 1 year, Ocean and biosphere models: simple • Total requirements: 1016 floating-point operations per second • With a supercomputer capable of 10 Giga FLOPS, it will take 10 days to execute • Real-time processing of 3D graphics • Number of data elements: 109 (1024 in each dimension) • Number of operations per element : 200 • Update rate: 30 times per second • Total requirements: 6.4 x 1012 operations per second • With processor capable of 10 Giga IOPS, we need 640 of them
Parallel Computation • Parallel computation means • multiple CPUs • single or multiple streams of instructions • executing multiple instructions at a time • Typical process • Breaking a problem into pieces and arranging for all pieces to be solved simultaneously on a multi-CPU computer system • Requirements • Parallel algorithms • only parallelizable applications can benefit from parallel implementation • Parallel languages • expressing parallelism • Parallel architectures • provide hardware support
network Background : Message Passing Model • A set of cooperating sequential processes • Each with own local address space • Processes interact with explicit transaction (send, receive,…) • Advantage • Programmer controls data and work distribution • Disadvantage • Communication overhead for small transactions • Hard to program! • Example : MPI Address space Process
Message Passing • Popular approach to parallelism • Can build up sophisticated set of utilities based on • What processor am I? • How many processors are there? • Send & Receive (blocking or unblocking) • MPI (Message Passing Interface) is the most popular standard at present - many implementations - academic and vendor
network Background : Data Parallel Model • One thread (process) of execution • Different data items are manipulated in the same way by that thread • Conditional statements to exclude (or include) parts of data in an operation • Parallelism is implicit (compiler) • Advantage • Easy to write and comprehend • No synchronization • Disadvantage • No independent branching • Example: • HPF(High Performance Fortran) process Different data / address space
Background : Shared Memory Model • Different simultaneous execution threads (processes) • Read / Write to one shared memory space and invalidate if necessary. • Advantage • Read remote memory via an expression • Write remote memory through assignment • Disadvantage • Manipulating shared data leads to synchronization requirements • Does not allow locality exploitation • Example : OpenMP Thread 2 Thread 3 Thread 1 Shared address space (i.e. Shared variable x)
Distributed Shared Memory Model • Similar to the shared memory paradigm • Memory Mi has affinity to Thread i. • At the same time each thread has global view of memory. • Advantage: • Helps exploiting locality of references • Disadvantage: • Synchronization still necessary • Example: UPC, Titanium, • Co-Array, Global Arrays Thread 2 Thread 3 Thread 1 M1 M2 M3 Partitioned shared address space (with each partition having affinity to corresponding thread)
Historical Timeline Fortran 95 Co-Array Fortran Developed by Rober Numrich in Minnesota Supercomputing Institute and NASA. Added parallel extension to Fortran 95. Global Arrays toolkit From Pacific Northwest National Laboratory. Library based interface for C, C++, Fortran, and Python. ARMCI (one-sided communication library) version started in 1998. UPC (Unified parallel C) Consortium of government, academia, and HPC vendors coordinated by GMU, IDA, NSA. MPI, OpenMP, HPF Titanium Led by professor Yelick from U of California, Berkeley. 2004 1994 1998 1999
Distributed Systems • Distributed systems communicate over a network • Various common assumptions make them behave differently from a truly integrated parallel system • A distributed system might be spread geographically over large distances • Separately owned and operated components - much less reliable • Typically we use a model like client-server or per-to-peer and our distributed programs communicate via a library of communications package (possibly implemented as “middleware” )
Parallelism Languages • Parallel and Distributed languages contain constructs for indicating that pieces or program should be run concurrently (at the same time) • Some problems, while not inherently parallel can be made to run more efficiently by implementing them on a parallel system to exploit several processors at once
Parallel Computing Terminology • Hardware • Multicomputers • tightly networked, multiple uniform computers • Multiprocessors • tightly networked, multiple uniform processors with additional memory units • Supercomputers • general purpose and high-performance, nowadays almost always parallel • Clusters • Loosely networked commodity computers
Parallel Computing Terminology • Programming • Pipelining • divide computation into stages (segments) • assign separate functional units to each stage • Data Parallelism • multiple (uniform) functional units • apply same operation simultaneously to different elements of data set • Control Parallelism • multiple (specialized) functional units • apply distinct operations to data elements concurrently
An Illustrative Example • Problem • Find all primes less than or equal to some positive integer n • Method (the sieve algorithm) • Write down all integers from 1 to n • Cross out from the list all multiples of 2, 3, 5, 7, … up to sqrt (n)
An Illustrative Example (cont.) • sequential Implementation • Boolean array representing the integers from 1 to n • Buffer for holding current prime • Index for loop iterating through the array
An Illustrative Example (cont.) • Control-Parallel Approach • Different processors strike out multiples of different primes • The boolean array and the current prime is shared; each processor has its own private copy of loop index
An Illustrative Example (cont.) • Data-Parallel Approach • Each processor responsible for a unique range of the integers, it does all the striking in that range • Processor 1 is responsible for broadcasting its findings to other processors • Potential Problem • If [n/p] < sqrt(n), more than one processor need to broadcast their findings
Parallel Language Constructs • There are several models. • There are changing fashions in the usage communities usually dictated by whatever the current vendors/products implement most efficiently. • Efficiency is the great driver - not necessarily elegance or ease of use, although this is beginning to change.
Co-Routines • Two or more interleaved routines that can be suspended and resumed. • See the even/odd example in the figure • The routines need to preserve their internal state. • Co-routining was popular in early days of parallelism - seems to have waned now.
Parallel Statements • Still popular and quite easy to use • Slow progress on standards body agreements although a lot of progress was made in High Performance Fortran • Two typical examples parfor and parbegin • parbegin just executes a list of statements in parallel (coarse grained parallelism) • parfor gives the semantics of a separate parallel execution of every “iteration” instance of the for index (fine grained parallelism) • A number of languages implement syntactic variations
Processes • By far the most common way to express parallelism • Usually quite coarse grained - ie not too many separate processes • Usually managed by the operating system • Processes can fork ( spawn a new copy of themselves) and later join it (merge back into one) • We talk about spawning a new process • A process is quiet heavyweight - separate memory segments, pages, processor context and a copy of all the management apparatus the operating system needs for a whole new user program instance • A process runs a program and can be stopped, started, killed etc.
Threads vs Processes • A thread is “lighter weight” than a process - typically a single process/program can have several concurrent threads. The process can time slice between its threads - hopefully giving them a fair slice of the available CPU resources • Typically threads only have a separate program counter - they share memory and other processor context with the other threads in the same process/program
Processes and threads in languages Black box view: T: thread T2 T0 T1 T1 T2 T0 . . . Tn FORK COBEGIN . . . FORK JOIN COEND JOIN
Concurrent and Parallel Programming Languages Classification of programming languages
Producer-consumer model for a parallel compiler The assembler code can be shared in some sort of shared array
Parallel distributed computing • Ada • used rendezvous concepts which combines feature of RPC and monitors • PVM (Parallel Virtual Machine) • to support workstation clusters • MPI (Message-Passing Interface) • programming interface for parallel computers
Message Passing Model • MIMD program. • An instance can decide what part of the work it should do and send and receive messages to and from its peers.
Tuple Space • The idea of managing a fake shared memory space and allolwing programs to put and get shared data to/from this space has been used to implement systems like the Linda shared tuple space • The Linda parallel programming language is based around this abstraction (all the underlying remote comms or message passing is hidden from the programmer)
Some Object Oriented syntax and apparatus might be used to make our shared integer abstraction work
Other Parallel Ideas • We can implement whatever primitives that will help us in our programming language • For example parallel prologue might have various ways of implementing AND/OR logical operations in parallel. • Other languages might have ways of evaluating expressions in parallel • In once sense we want to hide the low level parallel ideas away from the applications programmer - although it may be difficult
Can Design patterns bring order to parallel programming? • The book “Patterns for Parallel Programming” contains a design pattern language to capture how experts think about parallel programming. • It is an attempt to be to parallel programming what the GOF book was to object oriented programming. 38
Find Concurrency Original Problem Strategy and algorithm Tasks, shared and local data Program SPMD_Emb_Par () { TYPE *tmp, *func(); global_array Data(TYPE); global_array Res(TYPE); int N = get_num_procs(); int id = get_proc_id(); if (id==0) setup_problem(N,DATA); for (int I= 0; I<N;I=I+Num){ tmp = func(I); Res.accumulate( tmp); } } Program SPMD_Emb_Par () { TYPE *tmp, *func(); global_array Data(TYPE); global_array Res(TYPE); int N = get_num_procs(); int id = get_proc_id(); if (id==0) setup_problem(N,DATA); for (int I= 0; I<N;I=I+Num){ tmp = func(I); Res.accumulate( tmp); } } Program SPMD_Emb_Par () { TYPE *tmp, *func(); global_array Data(TYPE); global_array Res(TYPE); int N = get_num_procs(); int id = get_proc_id(); if (id==0) setup_problem(N,DATA); for (int I= 0; I<N;I=I+Num){ tmp = func(I); Res.accumulate( tmp); } } Program SPMD_Emb_Par () { TYPE *tmp, *func(); global_array Data(TYPE); global_array Res(TYPE); int Num = get_num_procs(); int id = get_proc_id(); if (id==0) setup_problem(N, Data); for (int I= ID; I<N;I=I+Num){ tmp = func(I, Data); Res.accumulate( tmp); } } Units of execution + new shared data for extracted dependencies Corresponding source code Concurrency in Parallel software: Supporting patterns
Supporting Patterns • Fork-join • A computation begins as a single thread of control. Additional threads are created as needed (forked) to execute functions and then when complete terminate (join). The computation continues as a single thread until a later time when more threads might be useful. • SPMD • Multiple copies of a single program are launched typically with their own view of the data. The path through the program is determined in part base don a unique ID (a rank). This is by far the most commonly used pattern with message passing APIs such as MPI. • Loop parallelism • Parallelism is expressed in terms of loops that execute concurrently. • Master-worker • A process or thread (the master) sets up a task queue and manages other threads (the workers) as they grab a task from the queue, carry out the computation, and then return for their next task. This continues until the master detects that a termination condition has been met, at which point the master ends the computation. • SIMD • The computation is a single stream of instructions applied to the individual components of a data structure (such as an array). • Functional parallelism • Concurrency is expressed as a distinct set of functions that execute concurrently. This pattern may be used with an imperative semantics in which case the way the functions execute are defined in the source code (e.g., event based coordination). Alternatively, this pattern can be used with declarative semantics, such as within a functional language, where the functions are defined but how (or when) they execute is dictated by the interaction of the data with the language model.
Example: File Search Problem • You have a collection of thousands of data-files. • Input: A string • Output: The number of times the input string appears in the collection of files. • Finding concurrency • Task Decomposition: The search of each file defines a distinct task. • Data Decomposition: Assign each file to a task • Dependencies: • A single group of tasks that can execute in any order … hence no partial orders to enforce • Data Dependencies: A single global counter for the number of times the string is found
Example: A parallel game engine • Computer games represent an important yet VERY difficult class of parallel algorithm problems. • A computer game is: • physics simulations, AI, complex graphics, etc. • Real time programming … latencies must be less than 50 milliseconds and ideally MUCH lower (16 millisecs … to match the frame update rate for satisfactory graphics). • The computational core of a computer game is the Game-engine.
Front End network Animation Render Data/state Sim data/state Input Audio Data/state control assets Media (disk, optical media) Data game state in its internally managed data structures. The heart of a game is the “Game Engine”
Front End network Animation Sim Render Data/state Sim data/state Input Audio Data/state control Time Loop assets Media (disk, optical media) Data Particle System Physics Sim is the integrator and integrates from one time step to the next; calling update methods for the other modules inside a central time loop. Collision Detection AI game state in its internally managed data structures. The heart of a game is the “Game Engine”
Front End network Animation Render Data/state Sim data/state Input Audio Data/state control assets Media (disk, optical media) Data Finding concurrency: Specialist parallelism Combine modules into groups, assign one group to a thread. Asynchronous execution … interact through events. Coarse grained parallelism dominated by flow of data between groups … Specialist parallelism strategy.
Co-Array real::x(n)[p,*] co-array (data) co-dimension (images)
Co-Array Declaration & Memory real :: x(n) real :: x(n)[*] x(1) x(2) x(3) . . . x(n) x(1) x(2) x(3) . . . x(n) x(1) x(2) x(3) . . . x(n) x(1) x(2) x(3) . . . x(n) image 1 image 2 image 1 image 2 image 0 image 0 * Replicate an array x of length n to each image. * Local array x of length n
Examples of Co-Array Declarations real :: a(n)[*] - replicate array a of length n to all images. integer :: z[p,*] - organize logical two-dimensional grid p x (num_images()/p). - replicate scalar z to each image character :: b(n,m)[p,q,*] - organize logical three-dimensional grid p x q x (num_images()/(p x q)). - replicate two-dimensional array b of size n x m to each image. real, allocatable :: c(:)[:] - define allocatable pointer c. type(field) :: user_defined[*] - replicate user defined structure to all images. integer :: local_x - define local variable local_x
Co-Array Communication y(:) = x(:)[p,q] - copies array x from image (p,q) to local array y x(index(:)) = y[index(:)] - gather from all images in index structure value of y, and put into local array x do i=2, num_images() x(:) = x(:) + x(:)[i] end do - reduction on array x p[:] = x - broadcast value x to all images. *absent co-dimension defaults to local object