The Rise of Clusters and Parallel Computing in MATLAB: Overcoming Programming Challenges

MATLAB*P: Architecture Ron Choy, Alan Edelman Laboratory for Computer Science MIT

Outline • The “p” is for parallel • MATLAB is what people want

Supercomputing in 2003 • The impact of the Earth Simulator • The impact of the internet bubble • The few big players left • The rise of clusters

The Rise of Clusters • Cheap & everyone seems to have one

Some MIT Clusters • Cheap & everyone seems to have one

Rise of Clusters • One to Three Beowulfs is/are in the top 5 of the top500 list! • Increased computing power does not necessarily translate to increased productivity • Someone has to program these parallel machines, and this can be hard

Classes of interesting problems • Large collection of small problems – easy, usually involves running a large number of serial programs – embarrassingly parallel • Space of problems? • Large problems – might not even fit into the memory of a machine. Much harder since it usually involves communication between processes

Parallel programming (1990 – now) • Dominant model: MPI (Message Passing Interface) • Low level control of communications between processes • Send, Recv, Scatter, Gather …

MPI • It is not an easy way to do parallel programming. • Error prone – it is hard to get complex low level communications right • Hard to debug – MPI programs tend to die with obscure error messages. There is also no good debugger for MPI.

C with MPI #include <stdio.h> #include <stdlib.h> #include <math.h> #include "mpi.h" #define A(i,j) ( 1.0/((1.0*(i)+(j))*(1.0*(i)+(j)+1)/2 + (1.0*(i)+1)) ) void errorExit(void); double normalize(double* x, int mat_size); int main(int argc, char **argv) { int num_procs; int rank; int mat_size = 64000; int num_components; double *x = NULL; double *y_local = NULL; double norm_old = 1; double norm = 0; int i,j; int count; if (MPI_SUCCESS != MPI_Init(&argc, &argv)) exit(1); if (MPI_SUCCESS != MPI_Comm_size(MPI_COMM_WORLD,&num_procs)) errorExit();

C with MPI (2) if (0 == mat_size % num_procs) num_components = mat_size/num_procs; else num_components = (mat_size/num_procs + 1); mat_size = num_components * num_procs; if (0 == rank) printf("Matrix Size = %d\n", mat_size); if (0 == rank) printf("Num Components = %d\n", num_components); if (0 == rank) printf("Num Processes = %d\n", num_procs); x = (double*) malloc(mat_size * sizeof(double)); y_local = (double*) malloc(num_components * sizeof(double)); if ( (NULL == x) || (NULL == y_local) ) { free(x); free(y_local); errorExit(); } if (0 == rank) { for (i=0; i<mat_size; i++) { x[i] = rand(); } norm = normalize(x,mat_size); }

C with MPI (3) if (MPI_SUCCESS != MPI_Bcast(x, mat_size, MPI_DOUBLE, 0, MPI_COMM_WORLD)) errorExit(); count = 0; while (fabs(norm-norm_old) > TOL) { count++; norm_old = norm; for (i=0; i<num_components; i++) { y_local[i] = 0; } for (i=0; i<num_components && (i+num_components*rank)<mat_size; i++) { for (j=mat_size-1; j>=0; j--) { y_local[i] += A(i+rank*num_components,j) * x[j]; } } if (MPI_SUCCESS != MPI_Allgather(y_local, num_components, MPI_DOUBLE, x, num_components, MPI_DOUBLE, MPI_COMM_WORLD)) errorExit();

C with MPI (4) norm = normalize(x, mat_size); } if (0 == rank) { printf("result = %16.15e\n", norm); } free(x); free(y_local); MPI_Finalize(); exit(0); } void errorExit(void) { int rank; MPI_Comm_rank(MPI_COMM_WORLD,&rank); printf("%d died\n",rank); MPI_Finalize(); exit(1); }

C with MPI (5) double normalize(double* x, int mat_size) { int i; double norm = 0; for (i=mat_size-1; i>=0; i--) { norm += x[i] * x[i]; } norm = sqrt(norm); for (i=0; i<mat_size; i++) { x[i] /= norm; } return norm; }

The result? • A lot of researchers in the sciences need the power of parallel computing, but do not have the expertise to tackle the programming • Time saved by using parallel computers is offset by the additional time needed to program them

The result? (2) • A high level tool is needed to hide the complexity away from the users of parallel computers

Let’s go back to the serial worldfor a moment

Outline • The “p” is for parallel • MATLAB is what people want

What do people want? • Real applications programmers care less about “p-fold speedups” • Ease of programming, debugging, correct answer matters much more

MATLAB • MATrix LABoratory • Started out as an interactive frontend to LINPACK and EISPACK • Has since grown into a powerful computing environment with ‘Toolboxes’ ranging from neural networks to computational finance to image processing. • Many current users know nothing about linear algebra

MATLAB ‘language’ • MATLAB has a C-like scripting language, with HPF-like array indexing • Interpreted language • A wide range of linear algebra routines • Very popular in the scientific community as a prototyping tool – easier to use than C/FORTRAN

MATLAB ‘language’ (2) • MATLAB makes use of ATLAS (optimized BLAS) for basic linear algebra routines • Still, performance is not as good as C/FORTRAN, especially for loops • However MATLAB is catching up with its release 13 – (just-in-time compiling?)

Why people like MATLAB? • Convenience, not performance!

Wouldn’t it be nice to mixthe convenience of MATLABwithparallel computing?

Parallel MATLAB Survey • http://theory.lcs.mit.edu/~cly/survey.html • 27 parallel MATLAB projects found, using 4 approaches • Embarrassingly Parallel • Message Passing • Backend Support • Compiler

Embarrassingly Parallel

Embarrassingly Parallel (2) • Multiple MATLABs needed • Scatter/Gather at the beginning and end • No lateral communication

Message Passing

Message Passing (2) • Multiple MATLABs required • MATLABs talk to each other in a general MPI fashion • Superset of embarrassingly parallel

Backend support ScaLAPACK FFTW PBLAS PARPACK ……

Backend support • Requires 1 MATLAB • MATLAB is used as front end to a parallel computation engine

Compiler

Compiler (2) • Requires zero to one MATLAB • Compiles MATLAB scripts into native code and link with parallel numerical libraries

MATLAB*P • Project started by P. Husbands, who is now at NERSC • Goal: To provide a user-friendly tool for parallel computing • Uses MATLAB as a front end to a parallel linear algebra server • Encompasses the backend support, message passing and embarrassingly parallel approaches

MATLAB*P (2) • Server leverages on popular parallel numerical libraries: • ScaLAPACK • FFTW • PARPACK • …

Matlab*P A = rand(4000*p, 4000*p); x = randn(4000*p, 1); y = zeros(size(x)); while norm(x-y) / norm(x) > 1e-11 y = x; x = A*x; x = x / norm(x); end;

Structure of the System Package Manager Client Manager Proxy Client Matrix Manager Server Manager MATLAB

Structure of the system (2) • Client (C++) • Attaches to MATLAB through MEX interface • Relays commands to server and process returns • Server (C++/MPI) • Manipulates/processes data (matrices) • Performs the actual parallel computation by calling the appropriate library routines

Functionalities provided • PBLAS, ScaLAPACK • solve, eig, svd, chol, matmul, +, -, … • FFTW • 1D and 2D FFT • PARPACK • eigs, svds • Support provided for s,d,c,z

Focus • Requires minimal learning on user’s part • Reuses of existing scripts • Mimics MATLAB behaviour • Data stay on backend until explicitly retrieved by user • Extendable backend

Comparison

What is *p? • MATLAB*P creates a new MATLAB class called layoot • By providing overloaded functions for this new class, parallelism is achieved transparently • *p is layoot(1)

Example • A = randn(1024*p,1024*p); • E = eig(A); • e = pp2matlab(E); • plot(e,’*’);

Example 2 • H = hilb(n) % H = hilb(8000*p) • J = 1:n; • J = J(ones(n,1),:); • I = J’; • E = ones(n,n); • H = E./(I+J-1);

Built-in MATLAB functions • Hilb is a MATLAB function that get parallelized automatically • Another example of a MATLAB function that works is cgs • We do not know how many of them works

Where is the data? • One often-asked question in parallel computing is: where are the data residing? • MATLAB*P supports 3 data distributions: • A = randn(m*p,n) – row distribution • B = randn(m,n*p) – column distribution • C = randn(m*p,n*p) – 2D block cyclic distribution

Flow of a MATLAB*P call • User types A = randn(256*p) in MATLAB • The overloaded randn in layoot class is called • The overloaded method calls MATLAB*P client with the arguments ‘ppbase_addDense’, 256, 3 (for cyclic dist) • Client relays the command to server

The Rise of Clusters and Parallel Computing in MATLAB: Overcoming Programming Challenges