240 likes | 465 Views
Parallel Programming. Yang Xianchun Department of Computer Science and Technology Nanjing University. Introduction. Agenda. About the Course Evolution of Computing Technology Grand Challenge Problem Examples Motivations for Parallelism Parallel Computation A Road Map of Topics
E N D
Parallel Programming Yang Xianchun Department of Computer Science and Technology Nanjing University Introduction
Agenda • About the Course • Evolution of Computing Technology • Grand Challenge Problem Examples • Motivations for Parallelism • Parallel Computation • A Road Map of Topics • Parallel Computing Terminology • An Illustrative Example • Control-Parallel Approach • Data-Parallel Approach • Performance Analysis
About the course • Description • An introduction course to parallel programming concepts and techniques • No prior parallel computation background is necessary • Prerequisites • Working knowledge of computer systems • Adequate programming skill in C or C++ • Textbook • Harry Jordan and Gita Alaghband, “Fundamentals of Parallel Processing,” Prentice Hall, 2003.
About the course(cont.) • Topics • Introduction • Parallel architectures • Data parallel computing • Fortran 90 / 95 • Shared memory computing • Pthreads • OpenMP • Message passing programming • PVM • MPI • Applications • Sorting • Numerical algorithms • Graph algorithms • Interconnection network • Performance • Models
About the course(cont.) • Organization • Lectures • Programming projects • using parallel programming tools PVM and MPI on Linux / Unix machines • Homeworks • Class Tests and Final Exam (Open book)
Evolution of Computing Technology • Hardware • Vacuum tubes, relay memory • Discrete transistors, core memory • Integrated circuits, pipelined CPU • VLSI microprocessors, solid state memory • Languages and Software • Machine / Assembly languages • Algol / Fortran with compilers, batch processing OS • C, multiprocessing, timesharing OS • C++ / Java, parallelizing compilers, distributed OS
Evolution of Computing Technology(cont.) • The Driving Force Behind the Technology Advances • The ever-increasing demands on computing power • Scientific computing (e.g. Large-scale simulations) • Commercial computing (e.g. Databases) • 3D graphics and realistic animation • Multimedia internet applications
Grand Challenge Problem Examples • Simulations of the earth’s climate • Resolution: 10 kilometers • Period: 1 year • Ocean and biosphere models: simple • Total requirements: 1016 floating-point operations per second • With a supercomputer capable of 10 Giga FLOPS, it will take 10 days to execute • Real-time processing of 3D graphics • Number of data elements: 109 (1024 in each dimension) • Number of operations per element : 200 • Update rate: 30 times per second • Total requirements: 6.4 x 1012 operations per second • With processor capable of 10 Giga IOPS, we need 640 of them
Motivations for Parallelism • Conventional computers and sequential • a single CPU • a single stream of instructions • executing one instruction at a time (not completely true) • Single-CPU processor has a performance limit • Moore’s Law can’t go on forever • How to increase computing power? • Better processor design • More transistors, larger caches, advanced architectures • Better system design • Faster / larger memory, faster buses, better OS • Scale up the computer (parallelism) • Replicate hardware at component or whole computer levels • Parallel processor’s power is virtually unlimited • 10 processor @ 500 Mega FLOPS each = 5 Giga FLOPS • 100 processor @ 500 Mega FLOPS each = 50 Giga FLOPS • 1,000 processor @ 500 Mega FLOPS each = 500 Giga FLOPS • ...
Motivations for Parallelism(cont.) • Additional Motivations • Solving bigger problems • Lowering cost
Parallel Computation • Parallel computation means • multiple CPUs • single or multiple streams of instructions • executing multiple instructions at a time • Typical process • Breaking a problem into pieces and arranging for all pieces to be solved simultaneously on a multi-CPU computer system • Requirements • Parallel algorithms • only parallelizable applications can benefit from parallel implementation • Parallel languages and/or constructs • expressing parallelism • Parallel architectures • provide hardware support
A Road Map of Topics Machine Models • PRAM • LogP Parallel Architectures • Vector / SIMD / MIMD • design issues Programming Models • data parallel • shared memory • message passing Parallel Algorithms • master-slave • divide-conquer • control vs. Data • complexity analysis Programming Languages • Fortran 90 / 95 • Pthreads, OpenMP • PVM, MPI Applications • Scientific computations • data-intensive problems • performance measurement
Parallel Computing Terminology (1) • Hardware • Multicomputers • tightly networked, multiple uniform computers • Multiprocessors • tightly networked, multiple uniform processors with additional memory units • Supercomputers • general purpose and high-performance, nowadays almost always parallel • Clusters • Loosely networked commodity computers
Parallel Computing Terminology (2) • Programming • Pipelining • divide computation into stages (segments) • assign separate functional units to each stage • Data Parallelism • multiple (uniform) functional units • apply same operation simultaneously to different elements of data set • Control Parallelism • multiple (specialized) functional units • apply distinct operations to data elements concurrently
Parallel Computing Terminology (3) • Performance • Throughput • number of results per unit time • Speedup Time needed for the most efficient sequential algorithm S= ——————————————————————————— Time needed on a pipelined / parallel machine • Scalability • An algorithm is scalable if the available parallelism increases at least linearly with problem size • An architecture is scalable if it gives same performance per processor, as the number of processors and the size of the problem are both increased • Data-parallel algorithms tend to be more scalable than control-parallel algorithms
An Illustrative Example • Problem • Find all primes less than or equal to some positive integer n • Method (the sieve algorithm) • Write down all th integers from 1 to n • Cross out from the list all multiples of 2, 3, 5, 7, … up to sqrt (n)
An Illustrative Example (cont.) • sequential Implementation • Boolean array representing the integers from 1 to n • Buffer for holding current prime • Index for loop iterating through the array
An Illustrative Example (cont.) • Control-Parallel Approach • Different processors strike out multiples of different primes • The boolean array and the current prime is shared; each processor has its own private copy of loop index
An Illustrative Example (cont.) • Control-Parallel Approach (cont.) • Potential Problem — Race Conditions • Race 1: More than one processor may sieve multiples of the same prime • a processor reads the current prime, p, and goes off to sieve multiples of p ; later it finds a new prime and updates the current prime buffer • before the current prime is updated, another processor comes and reads p and goes off to do the same thing • Race 2: A processor may sieve multiples of a composite number • processor A start marking multiples of 2 • before it can mark any cells, processor B finds an unmarked cell, 3, and starts marking multiples of 3 • then processor C comes in and finds the next unmarked cell, 4, and starts marking multiples of 4 • These two race conditions would not cause incorrect result, but they will cause inefficiency
An Illustrative Example (cont.) • Data-Parallel Approach • Each processor responsible for a unique range of the integers, it does all the striking in that range • Processor 1 is responsible for broadcasting its findings to other processors • Potential Problem • If [n/p] < sqrt(n), more than one processor need to broadcast their findings
An Illustrative Example (cont.) • Performance Analysis • Sequential Algorithm • Cost of sieving multiples of 2: [(n-3)/2] • Cost of sieving multiples of 3: [(n-8)/3] • Cost of sieving multiples of 5: [(n-24)/5] • ... • For n=1,000, T=1,411 • Control-Parallel Algorithm • For p=2, n=1,000, T=706 • For p=3, n=1,000, T=499 • For p=4, n=1,000, T=499
An Illustrative Example (cont.) • Performance Analysis (cont.) • Data-Parallel Algorithm • Cost of broadcasting: k(P-1)l • Cost of striking: ([(n/p)/2]+ [(n/p)/3]+ … + [(n/p)/pk])χ • For p=2, n=1,000, T≈781 • For p=3, n=1,000, T≈ 471 • For p=4, n=1,000, T≈ 337