490 likes | 658 Views
Design and analysis of algorithms for multicore architectures. Alejandro Salinger April 2 nd , 2009 Joint work with Alex López-Ortiz and Reza Dorrigiv. Outline. Models of computation Motivation Parallelism in multicore Low Degree PRAM (LoPRAM) Work-optimal algorithms Divide & conquer
E N D
Design and analysis of algorithms for multicore architectures Alejandro Salinger April 2nd, 2009 Joint work with Alex López-Ortiz and Reza Dorrigiv
Outline • Models of computation • Motivation • Parallelism in multicore • Low Degree PRAM (LoPRAM) • Work-optimal algorithms • Divide & conquer • Dynamic programming • Related Work • Conclusions Design and Analysis of Algorithms for Multicore Architectures - Alejandro Salinger
Abstract Modeling • Capture the characteristics of a phenomenon, with the adequate degree of accuracy in order to facilitate analysis and prediction [Maggs et al.]. • Examples: Financial markets, weather forecast, particles movement, genetic information, etc. • Several models about the same system or phenomenon. • Trade-off between simplicity and accuracy. Design and Analysis of Algorithms for Multicore Architectures - Alejandro Salinger
Theoretical Models of Computation vs Design and Analysis of Algorithms for Multicore Architectures - Alejandro Salinger
CPU MEM I/O Random Access Machine (RAM) • Models Von Neumman’s architecture. • A program executing over an infinite array of registers. • Random access. Design and Analysis of Algorithms for Multicore Architectures - Alejandro Salinger
Random Access Machine (RAM) • Simple operation = 1 unit of time. • Memory access = 1 unit of time. • Model captures the essence of a computer. • Simple. • Useful in practice. versus Design and Analysis of Algorithms for Multicore Architectures - Alejandro Salinger
Parallelism is here! • Why multicore processors? • Sequential programming is too easy. • We love doing things in parallel. • We finally know how to effectively program in parallel. • None of the above. [~Valiant, SPAA08] There is no other way to make processors faster. Design and Analysis of Algorithms for Multicore Architectures - Alejandro Salinger
CPU Frequencies [Hennsey 06] Design and Analysis of Algorithms for Multicore Architectures - Alejandro Salinger
The Walls • High power consumption. Power Wall • Power ~CVf2. • Simpler processors enjoy more MIPS per watt. • Little instruction level parallelism. ILP Wall • Branch prediction, speculation, out-of-order issue, register renaming, etc. • Not good when there is control-dependant computations, data-dependant memory addressing. • Memory latency. Memory Wall • Memory and caches speed does not match processor. • Communication bottleneck. [Patterson, Smith] Design and Analysis of Algorithms for Multicore Architectures - Alejandro Salinger
AMD Opteron Quad Core Design and Analysis of Algorithms for Multicore Architectures - Alejandro Salinger
Multicore Architectures Predominant model in practice. “64 to 128 core per microprocessor by 2015” [Intel roadmap] “the next version of Moore’s law” [Steve Scott, CTO, Cray] Design and Analysis of Algorithms for Multicore Architectures - Alejandro Salinger
Parallel Models of Computation PRAM (Parallel Random Access Machine) • p synchronous processors. • Multiple-Instruction Multiple Data (MIMD). • Communication through shared memory. • Unit cost operations: read, compute, write. Memory Design and Analysis of Algorithms for Multicore Architectures - Alejandro Salinger
PRAM • Traditionally algorithms assume Θ(n) processors. • What if there are only p < n available processors? • Simulate the Θ(n) processor solution using Brent’s lemma. , m>p Recall, optimal speedup: Design and Analysis of Algorithms for Multicore Architectures - Alejandro Salinger
PRAM (cont.) • Facilitates analysis and design, however: • Memory accesses and local operations have different costs. • Memory has limited bandwidth. • Processes are not synchronous. • Difficult to derive algorithms that take full advantage of Θ(n) processors. Design and Analysis of Algorithms for Multicore Architectures - Alejandro Salinger
Multicore Parallelism • Small number of cores: • low degree parallelism • High-level thread-based control of parallelism. • Shared memory, private and shared caches. Design and Analysis of Algorithms for Multicore Architectures - Alejandro Salinger
What to do with these cores? “Programmability has now replaced power as the number one impediment to the continuation of Moore’s law.” [Gartner] • Before, applications took advantage of processor advances. • Now, we need to take advantage of parallelism. “Before, parallel programming meant high performance. Now it means everyday applications in laptops. The goal, the class of programmer and expectations are different.” [Andrew Chin, Intel] Design and Analysis of Algorithms for Multicore Architectures - Alejandro Salinger
The Multicore Challenge Design a model such that: • Reflects available degree of parallelism. • Multi-threaded. • Graceful degradation with a smaller number of processors than the originally assumed (dynamic). • Easy theoretical analysis. • Easy to program. Design and Analysis of Algorithms for Multicore Architectures - Alejandro Salinger
Low Degree Parallelism • Number of cores is growing. • Not a constant. • Modeled as ~log n. • Similar to bit-level parallelism: • Considered a constant when word was 4 or 8 bits. • Now described as ~log n in the word RAM model. Design and Analysis of Algorithms for Multicore Architectures - Alejandro Salinger
Low Degree PRAM (LoPRAM) • PRAM with O(log n) processors. • Multiple-Instruction Multiple-Data. • CREW: Concurrent-Read Exclusive-Write. • High-level thread-based parallelism (asynchronous). • Communication through shared memory. • Semaphores and automatic serialization available and transparent to programmer. • p = O(log n) but not p = Θ(log n). Design and Analysis of Algorithms for Multicore Architectures - Alejandro Salinger
Threads in the LoPRAM • PAL threads: Parallel ALgorithmic threads • A pal-thread call is a request for the creation of a thread. • The thread that requests does it in batch mode and suspends execution until requests are done. • If there are cores available, the pal-thread is activated and becomes a conventional thread. • Otherwise, the request is added to a tree of requests, under the node corresponding to the calling node. pending active blocked Design and Analysis of Algorithms for Multicore Architectures - Alejandro Salinger
pending active blocked Threads in the LoPRAM • PAL threads (cont.) • When a thread blocks, the core is assigned to the first pending child. • If there are available cores, pending requests are activated in the order given by the breadth-first order traversal of the tree. • When there are no pending children, control goes back to the parent. Design and Analysis of Algorithms for Multicore Architectures - Alejandro Salinger
Example: MergeSort void mergeSort(int numbers[], int temp[], int array_size) { m_sort(numbers, temp, 0, array_size - 1); } void m_sort(int numbers[], int temp[], int left, int right) { int mid = (right + left) / 2; if (right > left) { m_sort(numbers, temp, left, mid); m_sort(numbers, temp, mid+1, right); merge(numbers, temp, left, mid+1, right); } } Design and Analysis of Algorithms for Multicore Architectures - Alejandro Salinger
Example: MergeSort void mergeSort(int numbers[], int temp[], int array_size) { m_sort(numbers, temp, 0, array_size - 1); } void m_sort(int numbers[], int temp[], int left, int right) { int mid = (right + left) / 2; if (right > left) { palthreads { // do in parallel if possible m_sort(numbers, temp, left, mid); m_sort(numbers, temp, mid+1, right); } // implicit join merge(numbers, temp, left, mid+1, right); } } Design and Analysis of Algorithms for Multicore Architectures - Alejandro Salinger
Order of Execution pending active blocked Mergesort with n = 16 and p = 4. Design and Analysis of Algorithms for Multicore Architectures - Alejandro Salinger
Analysis The first term dominates so long as p = O(log n). Design and Analysis of Algorithms for Multicore Architectures - Alejandro Salinger
In general: Divide & Conquer • Recursive divide-and-conquer algorithms with time given by: • By the master theorem: Design and Analysis of Algorithms for Multicore Architectures - Alejandro Salinger
Divide & Conquer Design and Analysis of Algorithms for Multicore Architectures - Alejandro Salinger
Case 1 Case 2 Tp(n) = Θ(T(n)/p) works so long as p = O(log n). Design and Analysis of Algorithms for Multicore Architectures - Alejandro Salinger
Case 3 i) Sequential merging: Tp(n) = Θ(f(n)) ii) Parallel merging: Tp(n) = Θ(f(n)/p) Design and Analysis of Algorithms for Multicore Architectures - Alejandro Salinger
Master Theorem for LoPRAM • Divide and Conquer algorithms that can be solved with the master theorem report optimal speedup, so long as the number of processors is less than or equal to log n, i.e. Tp(n) = T1(n)/p Design and Analysis of Algorithms for Multicore Architectures - Alejandro Salinger
LoPRAM • Optimal speedup so long as p≤ √n or p≤ log n depending on the cost of the merging phase. • The p≤ √n barrierwasobserved for certain P-complete problems in the PRAM [Kruskal et al 90’]. • The p≤ log n barrier was observed in heaps [Munro and Robertson 79’]. • With these bounds, communication between processor is practical with a complete network. Design and Analysis of Algorithms for Multicore Architectures - Alejandro Salinger
T(n)=2T(n/2)+O(n) T(n)=O(nlog n) Tp(n)= O((nlog n)/p) Mergesort Design and Analysis of Algorithms for Multicore Architectures - Alejandro Salinger
T(n)=2T(n/2)+O(n) T(n)=O(nlog n) Tp(n)= O((nlog n)/p) Mergesort Design and Analysis of Algorithms for Multicore Architectures - Alejandro Salinger
T(n)=7T(n/2)+O(n2) T(n)=O(n2.8) Tp(n)=O(n2.8/p) Matrix Multiplication Design and Analysis of Algorithms for Multicore Architectures - Alejandro Salinger
T(n)=7T(n/2)+O(n2) T(n)=O(n2.8) Tp(n)=O(n2.8/p) Matrix Multiplication Design and Analysis of Algorithms for Multicore Architectures - Alejandro Salinger
LoPRAM • Divide and Conquer algorithms are easy to parallelize. • Not so easy when p is O(n) (e.g. mergesort). • Other examples? • Yes, it extends to dynamic programming, where it also reports optimal speedup. Design and Analysis of Algorithms for Multicore Architectures - Alejandro Salinger
Dynamic Programming • Generic parallel algorithms for problems solvable by dynamic programming. • Given a dynamic programming solution, determine the corresponding Directed Acyclic Graph (DAG) and execute it in parallel. • Speedup depends on the degree of parallelism of the DAG. Design and Analysis of Algorithms for Multicore Architectures - Alejandro Salinger
Dynamic Programming Design and Analysis of Algorithms for Multicore Architectures - Alejandro Salinger
Dynamic Programming Design and Analysis of Algorithms for Multicore Architectures - Alejandro Salinger
Dynamic Programming 0 2 4 0 2 0 5 3 0 2 1 6 0 0 2 0 4 3 2 1 Assume p = 3 0 1 2 0 0 Design and Analysis of Algorithms for Multicore Architectures - Alejandro Salinger
Related Work Bulk Parallel Synchronous (BSP) model [Valiant]: • Processors with local memory, router for point-to-point messages, periodic synchronization. • Synchronization every L step at most: cost for synchronization and communication latency. • g local operations per memory access: bandwidth limitations. • Incentive for latency hiding. Design and Analysis of Algorithms for Multicore Architectures - Alejandro Salinger
Related Work BSP*: BSP for multicores [Valiant] • d levels (pi,Li,mi,gi) • pi: number of components. • Li: synchronization cost. • mi: size of memory. • gi: data rate. • Level 0: cores. • Portable algorithms. • “Immortal algorithms.” level j-1 level j mj gj Design and Analysis of Algorithms for Multicore Architectures - Alejandro Salinger
Related Work Cilk: programming platform for multithreaded computations with provable performance • Parallel computation modeled as DAG of tasks. • Scheduler: work-stealing. • Processing time, space, and communication optimal to within constant factors for “fully strict” multithreaded computations. Design and Analysis of Algorithms for Multicore Architectures - Alejandro Salinger
Cache efficiency • Cost of memory access >> local operation. • Cost of memory access determined by cache miss or hit, not by routing cost, latency or gap (as in BSP or LogP models). • How to schedule for cache performance? • Private caches: processors working on different data (e.g. work stealing). • Shared cache: processors working on same data (e.g. Parallel Depth First). Design and Analysis of Algorithms for Multicore Architectures - Alejandro Salinger
Related Work: Cache [Blelloch et al. SODA08] • Multicore-cache model (L1 private, L2 shared). • Controlled-PDF scheduler. • Cache efficiency within a constant of sequential complexity for L1 and L2 caches for a broad class of divide and conquer algorithms. • Divide and conquer algorithms consider parallel merging. Design and Analysis of Algorithms for Multicore Architectures - Alejandro Salinger
Related Work: Cache [Chowdurry and Ramachandran, SPAA08] • Generic cache-efficient dynamic programming algorithms. • 3 models of caches: • Private caches for each core. • Shared cache for each cores. • Multicore: Private L1 and shared L2. • Algorithms for each of 3 types of problems: • Local dependency (e.g. longest common subsequence). • Gaussian Elimination Paradigm (e.g. LU decomposition). • Parenthesis problem (e.g. matrix chain multiplication). • Cache-efficient execution up to critical path length of the algorithm, I∞(n)=Θ(n). Design and Analysis of Algorithms for Multicore Architectures - Alejandro Salinger
Conclusions • Microprocessor development has shifted the paradigm from higher frequencies to multicore. • This scenario calls for a new approach in theoretical models of computation. • We introduced a new model that • is faithful to current architectures, • avoids the pitfalls of the PRAM, • is theoretically simple, • allows for significant classes of algorithms to be parallelized with little effort. Design and Analysis of Algorithms for Multicore Architectures - Alejandro Salinger
Future Work • Extend optimal parallelization to more general classes of divide and conquer algorithms and other types of problems. • Consider cache efficiency for different cache models, maybe for other types of problems. • Determine barrier in number of processors for optimal parallelization for other classes of problems. Design and Analysis of Algorithms for Multicore Architectures - Alejandro Salinger
Design and Analysis of Algorithms for Multicore Architectures - Alejandro Salinger