Optimization of Java-Like Languages for Parallel and Distributed Environments

Optimization of Java-Like Languages for Parallel and Distributed Environments Kathy Yelick U.C. Berkeley Computer Science Division http://www.cs.berkeley.edu/~yelick/talks.html

What this tutorial is about • Language and compiler support for: • Performance • Programmability • Scalability • Portability • Some of this is specific to the Java language (not the JVM), but much of it applies to other parallel languages

Titanium Titanium will be used as an examples • Based on Java • Has Java’s syntax, safety, memory management, etc. • Replaces Java’s thread model with static threads (SPMD) • Other extensions for performance and parallelism • Optimizing compiler • Compiles to C (and from there to executable) • Synchronization analysis • Various optimizations • Portable • Runs on uniprocessors, shared memory, and clusters

Organization • Can we use Java for high performance on • 1 processor machines? • Java commercial compilers on some Scientific applications • Java the language, compiled to native code (via C) • Extensions of Java to improve performance • 10-100 processor machines? • 1K-10K processor machines? • 100K-1M processor machines?

SciMark Benchmark • Numerical benchmark for Java, C/C++ • Five kernels: • FFT (complex, 1D) • Successive Over-Relaxation (SOR) • Monte Carlo integration (MC) • Sparse matrix multiply • dense LU factorization • Results are reported in Mflops • Download and run on your machine from: • http://math.nist.gov/scimark2 • C and Java sources also provided Roldan Pozo, NIST, http://math.nist.gov/~Rpozo

SciMark: Java vs. C(Sun UltraSPARC 60) * Sun JDK 1.3 (HotSpot) , javac -0; Sun cc -0; SunOS 5.7 Roldan Pozo, NIST, http://math.nist.gov/~Rpozo

SciMark: Java vs. C(Intel PIII 500MHz, Win98) * Sun JDK 1.2, javac -0; Microsoft VC++ 5.0, cl -0; Win98 Roldan Pozo, NIST, http://math.nist.gov/~Rpozo

Can we do better without the JVM? • Pure Java with a JVM (and JIT) • Within 2x of C and sometimes better • OK for many users, even those using high end machines • Depends on quality of both compilers • We can try to do better using a traditional compilation model • E.g., Titanium compiler at Berkeley • Compiles Java extension to C • Does not optimize Java arrays or for loops (prototype)

Java Compiled by Titanium Compiler

Language Support for Performance • Multidimensional arrays • Contiguous storage • Support for sub-array operations without copying • Support for small objects • E.g., complex numbers • Called “immutables” in Titanium • Sometimes called “value” classes • Unordered loop construct • Programmer specifies iteration independent • Eliminates need for dependence analysis – short term solution? Used by vectorizing compilers.

HPJ Compiler from IBM • HPJ Compiler from IBM Research • Moreira et. al • Program using Array classes which use contiguous storage • e.g. A[i][j] becomes A.get(i,j) • No new syntax (worse for programming, but better portability – any Java compiler can be used) • Compiler for IBM machines, exploits hardware • e.g., Fused Multiply-Add • Result: 85+% of Fortran on RS/6000

Java vs. Fortran Performance *IBM RS/6000 67MHz POWER2 (266 Mflops peak) AIX Fortran, HPJC

Organization • Can we use Java for high performance on • 1 processor machines? • 10-100 processor machines? • A correctness model • Cycle detection for reordering analysis • Synchronization analysis • 1K-10K processor machines? • 100K-1M processor machines?

Java Parallel Programming Parallel programming models and language are distinguished primary by: • How parallel processes/threads are created • Statically at program startup time • The SPMD model, 1 thread per processor • Dynamically during program execution • Through fork statements or other features • How the parallel threads communicate • Through message passing (send/receive) • By reading and writing to shared memory Implicit parallelism not included here

Two Problems • Compiler writers would like to move code around • The hardware folks also want to build hardware that dynamically moves operations around • When is reordering correct? • Because the programs are parallel, there are more restrictions, not fewer • The reason is that we have to preserve semantics of what may be viewed by other processors

Sequential Consistency • Given a set of executions from n processors, each defines a total order Pi. • The program order is the partial order given by the union of these Pi ’s. • The overall execution is sequentially consistent if there exists a correct total order that is consistent with the program order. write x =1 read y  0 When this is serialized, the read and write semantics must be preserved write y =3 read z 2 read x 1 read y  3

Sequential Consistency Intuition • Sequential consistency says that: • The compiler may only reorder operations if another processor cannot observe it. • Writes (to variables that are later read) cannot result in garbage values being written. • The program behaves as if processors take turns executing instructions • Comments: • In a legal execution, there will typically be many possible total orders – limited only the reads and writes to shared variables • This is what you get if all reads and writes go to a single shared memory, and accesses serialized at memory cell

How Can Sequential Consistency Fail? • The compiler saves a value in a register across multiple read accesses • This “moves” the later reads to the point of the first one • The compiler saves a value in a register across writes • This “moves” the write until the register is written back from the standpoint of other processors. • The compiler performance common subexpression elimination • As if the later expression reads are all moved to the first • Once contiguous in the instruction stream, they are merged • The compiler performs other code motion • The hardware has a write buffer • Reads may by-pass writes in the buffer (to/from different variables) • Some write buffers are not FIFO • The hardware may have out-of-order execution

Weaker Correctness Models • Many systems use weaker memory models: • Sun has TSO, PSO, and RMO • Alpha has its own model • Some languages do as well • Java also has its own, currently undergoing redesign • C spec is mostly silent on threads – very weak on memory mapped I/O • These are variants on the following, sequential consistency under proper synchronization: • All accesses to shared data must be protected by a lock, which must be a primitive known to the system • Otherwise, all bets are off (extreme)

Why Don’t Programmers Care? • If these popular languages have used these weak models successfully, then what is wrong? • They don’t worry about what they don’t understand • Many people use compilers that are not very aggressive about reordering • The hardware reordering is non-deterministic, and may happen very infrequently in practice • Architecture community is way ahead of us in worrying about these problems. • Open problem: A hardware simulator and/or Java (or C) compiler that reorders things in the “worst possible way”

Using Software to Mask Hardware • Recall our two problems: • Compiler writers would like to move code around • The hardware folks also want to build hardware that dynamically moves operations around • The second can be viewed as compiler problem • Weak memory models come extra primitives, usually called fences or memory barriers • Write fence: wait for all outstanding writes from this processor to complete • Read fence: do not issue any read pre-fetches before this point

Use of Memory Fences • Memory fences can turn a particular memory model into sequential consistency under proper synchronization: • Add a read-fence to acquire lock operation • Add a write fence to release lock operation • In general, a language can have a stronger model than the machine it runs if the compiler is clever • The language may also have a weaker model, if the compiler does any optimizations

Aside: Volatile • Because Java and C have weak memory models at the language level, they give programmers a tool: volatile variables • These variables should not be kept in registers • Operations should not be reordered • Should have mem fences around accesses • General problem • This is a big hammer which may be unnecessary • No fine-grained control over particular accesses or program phases (static notion) • To get SC using volatile, many variables must be volatile

How Can Compilers Help? • To implement a stronger model on a weaker one: • Figure out what can legal be reordered • Do optimizations under these constraints • Generate necessary fences in resulting code • Open problem: Can this be used to give Java a sequentially consistent semantics? • What about C?

Compiler Analysis Overview • When compiling sequential programs, compute dependencies: Valid if y not in expr1 and x not in expr2 (roughly) • When compiling parallel code, we need to consider accesses by other processors. x = expr1; y = expr2; y = expr2; x = expr1; Initially flag = data = 0 Proc A Proc B data = 1; while (flag == 0); flag = 1; ... = ...data...;

write data read flag write flag read data Cycle Detection • Processors define a “program order” on accesses from the same thread P is the union of these total orders • Memory system define an “access order” on accesses to the same variable A is access order (read/write & write/write pairs) • A violation of sequential consistency is cycle in P U A [Shash&Snir]

Cycle Analysis Intuition • Definition is based on execution model, which allows you to answer the question: Was this execution sequentially consistent? • Intuition: • Time cannot flow backwards • Need to be able to construct total order • Examples (all variables initially 0) write data 1 read data 1 write data 1 read flag 1 write flag 1 read data 0 write flag 1 read flag 0

Cycle Detection Generalization • Generalizes to arbitrary numbers of variables and processors • Cycles may be arbitrarily long, but it is sufficient to consider only minimal cycles with 1 or 2 consecutive stops per processor • Can simplify the analysis by assuming all processors run a copy of the same code write x write y read y read y write x

read x write z write y read y write z Static Analysis for Cycle Detection • Approximate P by the control flow graph • Approximate A by undirected “conflict” edges • Bi-directional edge between accesses to the same variable in which at least one is a write • It is still correct if the conflict edge set is a superset of the reality • Let the “delay set” D be all edges from P that are part of a minimal cycle • The execution order of D edge must be preserved; other P edges may be reordered (modulo usual rules about serial code)

Cycle Detection in Practice • Cycle detection was implemented in a prototype version of the Split-C and Titanium compilers. • Split-C version used many simplifying assumptions. • Titanium version had too many conflict edges. • What is needed to make it practical? • Finding possibly-concurrent program blocks • Use SPMD model rather than threads to simplify • Or apply data race detection work for Java threads • Compute conflict edges • Need good alias analysis • Reduce size by separating shared/private variables • Synchronization analysis

Synchronization Analysis • Enrich language with synchronization primitives • Lock/Unlock or “synchronized” blocks • Post/Wait or Wait/Notify on condition variables • Global barriers: all processors wait at barrier • Compiler can exploit understanding of synchronization primitives to reduce cycles • Note: use of language primitives for synchronization may aid in optimization, but “rolling your own” is still correct

Edge Ordering • Post-Wait operations on the a variable can be ordered • Although correct to treat these as shared memory accesses, we can get leverage by ordering them • Then turn edges • ?  post c into delay edges • wait c  ? into delay edges • And oriented corresponding conflict edges post c wait c …

Edge Deletion • In SPMD programs, the most common form of synchronization is global barrier • If we add to the delay set edges of the form • ?  barrier • barrier  ? Then we can remove corresponding conflict edges … barrier barrier …

Synchronization in Cycle Detection • Iterative algorithm • Compute delay set restrictions in which at least one operation is a synchronization operation • Perform edge orientation and deletion • Compute delay set on remaining conflict edges • Two important details • For locks (and synchronized) we need good alias information about the lock variables. (Conservative would probably work…) • For barriers, need to line up corresponding barriers

Static Analysis for Barriers • Lining up barriers is needed for cycle detection. • Mis-aligned barriers are also a source of bugs inside branches or loops. • Includes other global communication primitives barrier, broadcast, reductions • Titanium uses barrier analysis, based on the idea of single variables and methods: • A “single” method is one called by all procs public single static void allStep(...) • A “single” variable has same value on all procs int single timestep = 0;

Single Analysis • The underlying requirement is that barriers only match the same textual instance • Complication from conditionals: if (this processor owns some data) { compute on it barrier } • Hence the use of “single” variables in Titanium • If a conditional or loop block contains a barrier, all processors must execute it • expression in such loops headers, if statements, etc. must contain only single variables

Single Variable Example in Titanium • Barriers and single in N-body Simulation class ParticleSim { public static void main (String [] argv) { int single allTimestep = 0; int singleallEndTime = 100; for (; allTimestep < allEndTime; allTimestep++){ // read all particles and compute forces on mine computeForces(…); Ti.barrier(); // write to my particles using new forces spreadParticles(…); Ti.barrier(); } } } • Single methods are automatically inferred, variables not

Some Open Problems • What is the right semantic model for shared memory parallel languages? • Is cycle detection practical on real languages? • How well can synchronization be analyzed? • Aliases between non-synchronizing variables? • Can we distinguish between shared and private data? • What is the complexity on real applications? • Analysis in programs with dynamic thread creation

Organization • Can we use Java for high performance on a • 1 processor machine? • 10-100 processor machine? • 1K-10K processor machine? • Programming model landscape • Global address space language support • Optimizing local pointers • Optimizing remote pointers • 100K-1M processor machine?

Programming Models at Scale • Large scale machines are mostly • Clusters of uniprocessors or SMPs • Some have hardware support for remote memory access • Shmem in Cray T3E • GM layer in Myrinet • DSM on SGI Origin 2K • Yet most programs are written in: • SPMD model • Message passing • Can we use a simpler, shared memory model? • On Origin, yes, but what about large machines?

Global Address Space • To run shared memory programs on distributed memory hardware, we replace references (pointers) by global ones: • May point to remote data • Useful in building large, complex data structures • Easy to port shared-memory programs (functionality is correct) • Uniform programming model across machines • Especially true for cluster of SMPs • Usual implementation • Each reference contains: • Processor id (or process id on cluster of SMPs) • And a memory address on that processor

Use of Global / Local • Global pointers are more expensive than local • When data is remote, it turns into a remote read or write) which is a message call of some kind • When the data is not remote, there is still an overhead • space (processor number + memory address) • dereference time (check to see if local) • Conclusion: not all references should be global -- use normal references when possible.

Explicit Approach to Global/Local • A common approach in parallel languages is to distinguish between local and global (“possibly remote”) pointers in the language. • Two variations are: • Make global the default – nice for porting shared memory programs • Make local the default – nice for calling libraries on a single processor that were built for uniprocessor • Titanium uses global deafault, with local declarations in important sections

lv lv lv lv lv lv gv gv gv gv gv gv Global Address Space • Processes allocate locally • References can be passed to other processes Other processes Process 0 LOCAL HEAP LOCAL HEAP class C { int val;... } C gv; // global pointer C local lv; // local pointer if (thisProc() == 0) { lv = new C(); } gv = broadcast lv from 0; gv.val = ...; ... = gv.val;

Local Pointer Analysis • Compiler can infer locals using Local Qualification Inference • Data structures must be well partitioned

Remote Accesses • What about remote accesses? In this case, the cost of the storage and extra check is small relative to the message cost. • Strategies for reducing remote accesses: • Use non-blocking writes – do not wait for them to performed • Use prefetching for reads – ask before data is needed • Aggregate several accesses to the same processor together • All of these involve reordering or the potential for reordering

Communication Optimizations • Data on an old machine, UCB NOW, using a simple subset of C Time (normalized)

Example communication costs • latency (a) and bandwidth (b) measured in units of flops • b measured per 8-byte word Machine Year ab Mflop rate per proc CM-5 1992 1900 20 20 IBM SP-1 1993 5000 32 100 Intel Paragon 1994 1500 2.3 50 IBM SP-2 1994 7000 40 200 Cray T3D (PVM) 1994 1974 28 94 UCB NOW 1996 2880 38 180 UCB Millennium 2000 50000 300 500 SGI Power Challenge 1995 3080 39 308 SUN E6000 1996 1980 9 180 SGI Origin 2K 2000 5000 25 500

Organization • Can we use Java for high performance on a • 1 processor machine? • 10-100 processor machine? • 1K-10K processor machine? • 100K-1M processor machine? • Kinds of machines • Open problems

Optimization of Java-Like Languages for Parallel and Distributed Environments